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Abstract 

Variable selection is a challenging issue in statistical applications when the number 
of predictors p far exceeds the number of observations n. In this ultra-high dimensional 
setting, the sure independence screening (SIS) procedure was introduced to significantly 
reduce the dimensionality by preserving the true model with overwhelming probability, 
before a refined second stage analysis. However, the aforementioned sure screening 
property strongly relies on the assumption that the important variables in the model 
have large marginal correlations with the response, which rarely holds in reality. To 
overcome this, we propose a novel and simple screening technique called the high¬ 
dimensional ordinary least-squares projection (HOLP). We show that HOLP possesses 
the sure screening property and gives consistent variable selection without the strong 
correlation assumption, and has a low computational complexity. A ridge type HOLP 
procedure is also discussed. Simulation study shows that HOLP performs competitively 
compared to many other marginal correlation based methods. An application to a 
mammalian eye disease data illustrates the attractiveness of HOLP. 

Keywords: Consistency; Forward regression; Generalized inverse; High dimensionality; Lasso; 
Marginal correlation; Moore-Penrose inverse; Ordinary least squares; Sure independent 
screening; Variable selection. 


1 Introduction 


The rapid advances of information technology have brought an unprecedented array of large 

and complex data. In this big data era, a dehning feature of a high dimensional dataset 
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is that the number of variables p far exceeds the number of observations n. As a result, 
the classical ordinary least-squares estimate (OLS) used for linear regression is no longer 
applicable due to a lack of sufficient degrees of freedom. 

Recent years have witnessed an explosion in developing approaches for handling large 
dimensional data sets. A common assumption underlying these approaches is that although 
the data dimension is high, the number of the variables that affect the response is relatively 
small. The hrst class of approaches aim at estimating the parameters and conducting variable 
selection simultaneously by penalizing a loss function via a sparsity inducing penalty. See, 
for example, the Lasso (|TibshiranI 1996; Zhao and Yu, 2006; Meinshausen and Biihlmann 


2008 

), the SCAD ( 

Fan and Li, 

2001 

), the adaptive Lasso ( 

Zou 

2006; 

Wang, et ah, 

2007 


Zhang and Lu, 2007), the grouped Lasso (Yuan and Lin, 2006), the LSA estimator (Wang 


and Len^ 2007), the Dantzig selector (Candes and Tao, 2007), the bridge regression (Huang, 


et al.| 2008), and the elastic net (Zou and Hastie, 2005 Zou and Zhang, 2009). However, 


accurate estimation of a discrete structure is notoriously difficult. For example, the Lasso can 
give non-consistent models if the irrepresentable condition on the design matrix is violated 


(Zhao and Yu, 2006 Zou, 2006), although computationally more extensive methods such as 


those combining subsampling and structure selection (Meinshausen and Biihlmann, 2010 


Shah and Samworth, 2013) may overcome this. 


In ultra-high dimensional cases where p is much larger than n, these penalized approaches 
may not work, and the computation cost for large-scale optimization becomes a concern. It 
is desirable if we can rapidly reduce the large dimensionality before conducting a rehned 


analysis. Motivated by these concerns. Fan and Lv (2008) initiated a second class of ap¬ 
proaches aiming to reduce the dimensionality quickly to a manageable size. In particular, 
they introduce the sure independence screening (SIS) procedure that can signihcantly reduce 
the dimensionality while preserving the true model with an overwhelming probability. This 
important property, termed the sure screening property, plays a pivotal role for the success 
of SIS. The screening operation has been extended, for example, to generalized linear models 


Fan and Fan 

2008 

Fan, et ah, 

2(1(1') 

Fan and Song 

2010 

), additive models ( 

Fan, et al. 


2011), hazard regression (Zhao and Li, 2012; Gorst-Rasmussen and Scheike, 2013), and to 


accommodate conditional correlation (Barut et ah, 2012). As the SIS builds on marginal 
correlations between the response and the features, various extensions of correlation have 


been proposed to deal with more general cases (Hall and Miller, 2009; Zhu, et ah, 2011 Li, 


Zhong, et ah, 2012 Li, Peng, et ah, 2012). A number of papers have proposed alternative 


ways to improve the marginal correlation aspect of screening, see, for example. Hall, et ah 


(2009); Wang (2009, 2012); Clio and Fryzlewicz (2012). 
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There are two important considerations in designing a screening operator. One pinnacle 
consideration is the low computational requirement. After all, screening is predominantly 
used to quickly reduce the dimensionality. The other is that the resulting estimator must 
possess the sure screening property under reasonable assumptions. Otherwise, the very 
purpose of variable screening is defeated. SIS operates by evaluating the correlations between 
the response and one predictor at a time, and retaining the features with top correlations. 
Clearly, this estimator can be much more efficiently and easily calculated than large-scale 


optimization. For the sure screening property, a sufficient condition made for SIS (Fan and 


Lv, 2008) is that the marginal correlations for the important variables must be bounded 


away from zero. This condition is referred to as the marginal correlation condition hereafter. 
However, for high dimensional data sets, this assumption is often violated, as predictors are 
often correlated. As a result, unimportant variables that are highly correlated to important 
predictors will have high priority of being selected. On the other hand, important variables 
that are jointly correlated to the response can be screened out, simply because they are 


marginally uncorrelated to the response. Due to these reasons. Fan and Lv (2008) put 
forward an iterative SIS procedure that repeatedly applies SIS to the current residual in 


hnite many steps. Wang (2009) proved that the classical forward regression can also be used 


for variable screening, and Cho and Fryzlewicz (2012) advocates a tilting procedure. 


In this paper, we propose a novel variable screener named High-dimensional Ordinary 
Least-squares Projection (HOLP), motivated by the ordinary least-squares estimator and the 
ridge regression. Like SIS, the resulting HOLP is straightforward and efficient to compute. 
Unlike SIS, we show that the sure screening property holds without the restrictive marginal 
correlation assumption. We also discussed Ridge-HOLP, a ridge regression version of HOLP. 
Theoretically, we prove that the HOLP and Ridge-HOLP possess the sure screening property. 
More interestingly, we show that both HOLP and Ridge-HOLP are screening consistent in 
that if we retain a model with the same size as the true model, then the retained model is 
the same as the true model with probability tending to one. We illustrate the performance 
of our proposed methods via extensive simulation studies. 


The rest of the paper is organized as follows. We elaborate the HOLP estimator and 
discuss two viewpoints to motivate it in Section 2. The theoretical properties of HOLP and 
its ridge version are presented in Section 3. In Section 4, we use extensive simulation study 
to compare the HOLP estimator with a number of competitors and highlight its competitive¬ 
ness. An analysis of data conhrms its usefulness. Section 5 presents the concluding remarks 
and discusses future research. All the proofs are found in the Supplementary Materials. 
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2 High-dimensional Ordinary Least-Squares Projection 


2.1 A new screening method 

Consider the familiar linear regression model 


y = (3ixi + (32X2 H-h l3pXp + 


where x = (xi, • • • ■,XpY' is the random predictor vector, £ is the random error and y is the 
response. Alternatively, with n realizations of x and y, we can write the model as 

F = X/3 + e, 

where Y G i?" is the response vector, X G is the design matrix, and e G i?" consists 

of i.i.d. errors. Without loss of generality, we assume that e* follows a distribution with 
mean 0 and variance Furthermore, we assume that X'^X is invertible when p < n and 
that XX'^ is invertible when p > n. Denote Xi = {xi, ...,Xp} as the full model and Xis as 
the true model where S = {j : /3j 7^ 0, j = 1, • • ■ ,p} is the index set of the nonzero P/s 
with cardinality s = [S'!. To motivate our method, we hrst look at a general class of linear 
estimates of [3 as 

13 = AY, 

where A G Rp^"^ maps the response to an estimate and the SIS sets A = X'^. Since our 
emphasis is for screening out the important variables, (3 as an estimate of [3 needs not be 
accurate but ideally it maintains the rank order of the entries of |/3| such that the nonzero 
entries of (3 are large in f3 relatively. Note that 

AY = A(X/3 + e) = (AX)/3 + Ae, 

where Ae consists of linear combinations of zero mean random noises and {AX)[3 is the 
signal. In order to preserve the signal part as much as possible, an ideal choice of A would 
satisfy that AX = I. If this choice is possible, the signal part would dominate the noise part 
Ae under suitable conditions. This argument leads naturally to the ordinary least-squares 
estimate where A = {X'^X)~^X'^ when p < n. 

However, when p is large than n, X'^X is degenerate and AX cannot be an identity 
matrix. This fact motivates us to use some kind of generalized inverse of X. In Part A of 
the Supplementary Materials we show that {X'^X)~^X'^ can be seen as the Moore-Penrose 
inverse of X for p < n and that is the Moore-Penrose inverse of X when p > n. 
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We remark that the Moore-Penrose inverse is one particular form of the generalized inverse 
of a matrix. When A = AX is no longer an identity matrix. Nevertheless, 

as long as AX is diagonally dominant, j3i {i G S) can take advantage of the large diagonal 
terms of AX to dominate /3j {i ^ S) that is just a linear combination of off-diagonal terms. 
To show the diagonal dominance of AX = X'^{XX'^)~^X, we quickly present a comparison 
to SIS. In Fig we plot AX for one simulated data set with (n,p) = (50,1000), where X 
is drawn from A^(0, S) with S satisfying one of the following: (i) S = Jp, (ii) cTjj = 0.6 and 
an = 1, (iii) aij = and (iv) aij = 0.995l*“-^L 



Figure 1: Heatmaps for AX = X'^X in SIS (top) and AX = X^(XX^) ^X for the proposed 
method (bottom). 


We see a clear pattern of diagonal dominance for X'^{XX'^)~^X under different scenarios, 
while the diagonal dominance pattern only emerges for AX = X'^X in some structures. To 
provide an analytical insight, we write X via singular value decomposition as X = VDU^, 
where 1/ is an n x n orthogonal matrix, H is an n x n diagonal matrix and f/ is an p x n 
matrix that belongs to the Stiefel manifold W,p- See Part B of the Supplementary Materials 
for details. Then 


X^{XX^)-^X = UU^, X^X = UD^U^. 

Intuitively, X'^{XX'^)~^X reduces the impact from the high correlation of X by removing 
the random diagonal matrix D. As further proved in Part C of the Supplementary Materials, 
UU'^ will be diagonal dominating with overwhelming probability. 
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These discussions lead to a very simple screening method by hrst computing 

/3 = X^{XX'^)-^Y. ( 1 ) 

We name this estimator /3 the High-dimensional Ordinary Least-squares Projection (HOLP) 
due to the similarity to the classical ordinary least-squares estimate. For variable screening, 
we follow a very simple strategy by ranking the components of /3 and selecting the largest 
ones. More precisely, let d be the number of the predictors retained after screening. We 
choose a submodel as 

Md = {xj : \ j3j\ are among the largest d of all |/3jj’s} or = {xj : |/3jj > 7 } 

for some 7 . To see why the HOLP is a projection, we can easily see that 

/3 = X^(XX^)-^X(] + X^{XX^)-h, 


where the hrst term indicates that this estimator can be seen as a projection of jS. However, 
this projection is distinctively different from the usual OLS projection: Whilst the OLS 
projects the response Y onto the column space of X, HOLP uses the row space of X to cap¬ 
ture jd. We note that many other screening methods, such as tilting and forward regression, 
also project Y onto the column space of X. Another important difference between these 
two projections is the screening mechanism. HOLP gives a diagonally dominant projection 
matrix such that the product of this matrix and (3 would be more likely 

to preserve the rank order of the entries in (3. In contrast, tilting and forward regression 
both rely on some goodness-of-ht measure of the selected variables, aiming to minimize the 
distance between htted Y and Y. An important feature of HOLP is that the matrix XX^ 
is of full rank whenever n < p, in marked contrast to the OLS that is degenerate whenever 
n < p. Thus, HOLP is unique to high dimensional data analysis from this standpoint. 

We now motivate HOLP from a different perspective. Recall the ridge regression estimate 


/3(r) = (rJ + X'X)-W^X, 


Fan and Lv 


where r is the ridge parameter. By letting r —)• 00 , it is seen that r/3{r) —)■ X^Y. 

(2008) proposed SIS that retains the large components in X^X as a way to screen variables. 


If we let r —)■ 0, the ridge estimator /3{r) becomes 


(X^X)+X^X, 
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where A~^ denotes the Moore-Penrose generalized inverse. An application of the Sherman- 
Morrison-Woodbury formula in Part A of the Supplementary Materials gives 

(rJ + X^Xy^X^Y = X^{rl + XX^y^Y. 


Then letting r —)■ 0 gives 


{x^xyx^Y = x^{xxy-% 


the HOLP estimator in Q. Therefore, the HOLP estimator can be seen as the other extreme 
of the ridge regression estimator by letting r —)■ 0, as opposed to the marginal screening 
operator in Fan and Lv (2008) by letting r —)■ oo. In real data analysis where X and 

Y are often centered (denoted by X and Y), the ridge version of HOLP X'^{rl + XX'^)~^Y 
is the correct estimator to use as is now rank-degenerate. Theory on the ridge-HOLP 

is studied in next section and comparisons with HOLP are provided in the conclusion. 


Clearly, HOLP is easy to implement and can be efficiently computed. Its computational 
complexity is (9(n^p), while SIS is 0{np). In the ultra-high dimensional cases where rf 
for any c, the computational complexity of HOLP is only slightly worse than that of SIS. 
Another advantage of HOLP is its scale invariance in the signal part X^(XX^)“^X/3. In 
contrast, SIS is not scale-invariant in X^X/d and its performance may be affected by how 
the variables are scaled. 


3 Asymptotic Properties 

3.1 Conditions and assumptions 

Recall the linear model 

y = /32X2 + ■ ■ ■ + l 3 pXp -f- £, 

where x = (xi, ■ ■ ■ ,Xpy is the random predictor vector, e is the random error and y is the 
response. In this paper, X denotes the design matrix. Define Z and z respectively as 

where E = cov{x) is the covariance matrix of the predictors. For simplicity, we assume Xj’s 
to have mean 0 and standard deviation 1, i.e, E is the correlation matrix. It is easy to see 
that the covariance matrix of z is an identity matrix. The tail behavior of the random error 
has a significant impact on the screening performance. To capture that in a general form. 
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we present the following tail condition as a characterization of different distribution families 


studied in Vershynin (2010). 


Definition 3.1. (g-exponential tail condition) A zero mean distribution F is said to 
have a q-exponential tail, if any N > 1 independent random variables ei ^ F satisfy that for 
any a G with ||a ||2 = 1, the following inequality holds 


N 


Z=1 


FI I ^a^eil > t) < exp(l - q{t)) 


for any t > 0 and some function g(-). 


For example, if e* ~ iV(0,1), then Xlili ~ -^(0, !)• With the classical bound on the 
Gaussian tail, one can show that the Gaussian distribution admits a square-exponential tail 
in that q{t) = t‘^/2. 


This characterization of the tail behavior is an analog of Proposition 5.10 and 5.16 in|V( 


er- 


shynin (2010) and is very general. As shown in Vershynin (2010), we have q{t) = 0{F/ 


for some constant K depending on F if F is sub-Gaussian including Gaussian, Bernoulli, 
and any bounded random variables. And we have q{t) = (9(min{f/iF, if F is sub¬ 

exponential including exponential, Poisson and distribution. Moreover, as shown in 


Zhao 


and Yu (2006), any random variable satishes q{t) = 2k\ogt + 0(1) if it has bounded 2k^^ 


moments for some positive integer k. 


Throughout this paper, c* and Oj in various places are used to denote positive constants 
independent of the sample size and the dimensionality. We make the following assumptions. 


Al. The transformed z has a spherically symmetric distribution and there exist some Ci > 1 
and Oi > 0 such that 

P^Xmax{P~^ZZ'^) > Cl or \,nin{p~^ZZ^) < , 

where \max{-) and \min{-) are the largest and smallest eigenvalues of a matrix respec¬ 
tively. Assume p > Cqu for some Cq > 1. 

A 2 . The random error £ has mean zero and standard deviation a, and is independent of x. 
The standardized error efa has g-exponential tails with some function g(-). 

A3. We assume that var{y) = 0(1) and that for some k> 0, z/>0,r>0 and C 2 , C 3 , C 4 > 0, 

min|/3j| > — s < and cond(S) < CifF, 


















where cond(S) = XmaxiJ ^)/is the conditional number of S. 


The assumptions are similar to those in Fan and Lv (2008) with a key difference. The strong 
condition on the marginal correlation between y and those Xj with j E S required by SIS to 
satisfy 


min|con(/3- y,Xj)\>C 5 
jes ■' 


( 2 ) 


for some constant C 5 , is no longer needed for HOLP. This marginal correlation condition, 


as pointed out by Fan and Lv (2008), can be easily violated if variables are correlated. 


Assumption Al is similar to but weaker than the concentration property in Fan and Lv 


(2008). See also Bai (1999). They require all the submatrices of Z consisting of more than 


cn rows for some positive c to satisfy this eigenvalue concentration inequality, while here we 


only require Z itself to hold. The proof in Fan and Lv (2008) can be directly applied to show 


that Al is true for the Gaussian distribution, and the results in Section 5.4 of Vershynin 


(2010) show that the deviation inequality is also true for any sub-Gaussian distribution. 


It becomes clear later in the proof that the inequality in Al is not a critical condition for 
variable screening. In fact, it can be excluded if the model is nearly noiseless. In A3, k 
controls the speed at which nonzero /?/s decay to 0 , z/ is the sparsity rate, and r controls 
the singularity of the covariance matrix. 


3.2 Main theorems 


We establish the important properties of HOLP by presenting three theorems. 
Theorem 1. (Screening property) Assume that A1-A3 hold. If we choose 7 „ such that 


Pin ^ , p'jnV'^ogn 


1 — r — K 


-E 0 and 


n 


n 


1 — T — K, 


OC, 


( 3 ) 


then for the same Ci specified in Assumption Al, we have 


P{MsCM,\ =1-0 exp -G 


n 




21 ogn 




n 


Note that we do not make any assumption on p in Theorem as long as p > CqU, 
allowing p to grow even faster than the exponential rate of the sample size commonly seen 
in the literature. The result in Theorem [T] can be of independent interest. If we specialize 
the dimension to ultra-high dimensional problems, we have the following strong results. 

Theorem 2. (Screening consistency) In addition to the assumptions in Theorem S if p 
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further satisfies 


^l-2«-5r /y^^l/2-2r- 


( 4 ) 


then for the same 7 ^ defined in Theorem^ and the same Ci specified in Al, we have 


FI min 1/3,1 > 7 „ > max|;5,| ) 


= 1 - 0<^ exp -C 


n 


l — 2K—bT — U 


+ exp 1 - -g 


1 


—2t—k, 


2^1 71^ 


n 


21ogn 

Alternatively, we can choose a submodel Aid with d>^ for some l G (z/, 1] such that 


P[ Ais C Aid =1 — 0{ exp 


C*! 


n 


1—2«:—5t— z/ 


2 logn 


+ exp I 1 — -g 


1 /7^nV2-2r- 


2 *^V 


n 


The first part of Theorem [^states that if the number of predictors satishes the condition, 
the important and unimportant variables are separable by simply thresholding the estimated 
coefficients in fi. The second part simply states that as long as we choose a submodel with 
a dimension larger than that of the true model, we are guaranteed to choose a superset of 
the variables that contains the true model with probability close to one. If we choose d = s, 
then HOLP indeed selects the true model with an overwhelming probability. This result 
seems surprising at hrst glance. It is, however, much weaker than the consistency of the 


Lasso under the irrepresentable condition (Zhao and Yu, 2006), as the latter gives parameter 


estimation and variable selection at the same time, while our screening procedure is only 
used for pre-selecting variables. 

When the error e follows a sub-Gaussian distribution, HOLP can achieve screening con¬ 
sistency when the number of covariates increases exponentially with the sample size. 

Corollary 1 . (Screening consistency for sub-Gaussian errors) Assume A1-A3. If the stan¬ 
dardized error follows a sub-Gaussian distribution, i.e., q(t) = 0{tfi/K‘^) where K is some 
constant depending on the distribution, then the condition on p becomes 


logp = o 


n 


1-2k-5t 


logn 


and for the same 7 „ defined in Theorem^ we have 


F min 1/3) > 7 „ > max 1/3) = 1 - 0<^ exp - Ci- 

i&s i^s J [ \ 2 log n 


n 


l — 2K — bT—U 
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and with n'' for some i G (i^, 1], we have 

The next result is an extension of HOLP to the ridge regression. Recall the ridge regres¬ 
sion estimate 


/3(r) = {X^X + rlpY^X^Y = X^{XX^ + rQ-^Y. 


By controlling the diverging rate of r, a similar screening property as in Theorem holds 
for the ridge regression estimate. 

Theorem 3. (Screening consistency for ridge regression) Assume A1-A3 and thatp satisfies 
Q. If the tuning parameter r satisfies r = and in addition to 'jn further 

satisfies that —)■ oo, then for the same Ci in Al, we have 


P[ min|/3i(r)| > 7 ^ > max |/3i(r)| 

i&S i^S 


= 1 — 0\ exp ( — Cl 


n 


l—^r—2K,—u 


2 logn 


exp 1 - -g 


1 /VCinV2-2r- 


2 *^V 


n 


With d X n‘ for some l G (z/, 1] we have 


P{ Ms C Md = 1 - exp -Cl 


n 


1—2/^—5r—h* 


2 logn 


exp 


1 

2 ^ \ ^/logn 


In particular, for any fixed positive constant r, the above results hold. 


Theorem shows that ridge regression can also be used for screening variables. We 
recommended to use ridge regression for screening when is close to degeneracy or 

when n ~ p. Otherwise, HOLP is suggested due to its simplicity as it is tuning free. It is 
also easy to see that the ridge regression estimate has the same computational complexity 
as the HOLP estimator. A ridge regression estimator also provides potential for extending 
the HOLP screening procedure to models other than in linear regression. 


One practical issue for variable screening is how to determine the size of the submodel. 
As shown in the theory, as long as the size of the submodel is larger than the true model, 
HOLP preserves the non-zero predictors with an overwhelming probability. Thus, if we can 
assume s x for some z/ < 1, we can choose a submodel with size n, n — 1 or n/ logn (Fan 


and Lv, 2008 Li, Peng, et al., 2012), or using techniques such as extended BIG (Chen and 


Chen, 2008) to determine the submodel size (Wang, 2009). For simplicity, we mainly use n 
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as the submodel size in numerical study, with some exploration on the extended BIC. 


4 Numerical Studies 

In this section, we provide extensive numerical experiments to evaluate the performance of 
HOLP. The structure of this section is organized as follows. In Part 1, we compare the 


screening accuracy of HOLP to that of (I) SIS in Fan and Lv (2008), robust rank correlation 
based screening (RRCS, Li, et ah 2012), the forward regression (FR, Wang, 2009), and the 


tilting (Cho and Fryzlewicz, 2012). In Part 2, Theorem 2 and 3 are numerically assessed 


under various setups. Because computational complexity is key to a successful screening, in 
Part 3, we document the computational time of various methods. Finally, we evaluate the 
impact of screening by comparing two-stage procedures where penalized likelihood methods 
are employed after screening in Part 4. For implementation, we make use of the existing R 
package “SIS” and “tilting”, and write our own code in R for forward regression. 

Although not presented, we have evaluated two additional screeners. The hrst is the 
Ridge-HOLP by setting r = 10. We found that the performance is similar to HOLP and 


therefore report its result only for Part 2. Motivated by the iterative SIS of Fan and Lv 


(2008), we also investigated an iterative version of HOLP by adding the variable correspond¬ 


ing to the largest entry in HOLP, one at a time, to the chosen model. In most cases studied, 
the screening accuracy of Iterative-HOLP is similar to or slightly better than HOLP but the 
computational cost is much higher. As computation efficiency is one crucial consideration 
and also due to the space limit, we decide not to include the results. 


4.1 Simulation study I: Screening accuracy 

For simulation study, we set (p, n) = (1000,100) or (p, n) = (10000,200) and let the random 
error follow iV(0,cr^) with cr^ adjusted to achieve different theoretical values dehned as 


^ var{x^/3)/var{y) (Wang, 2009). We use either = 50% for low or R^ = 90% for 


high signal-to-noise ratio. We simulate covariates from multivariate normal distributions 
with mean zero and specify the covariance matrix as the following six models. For each 
simulation setup, 200 datasets are used for p = 1000 and 100 datasets are for p = 10000. 
We report the probability of including the true model by selecting a sub-model of size n. No 
results are reported for tilting when (p, n) = (10000, 200) due to its immense computational 
cost. 


(i) Independent predictors. This example is from Fan and Lv (2008) and Wang (2009) 
with S = {1,2, 3,4, 5}. We generate Xj from a standard multivariate normal distribution 
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with independent components. The coefficients are specihed as 


= (—1)“*(|A^(0,1)1 + Alogn/y/n), where Ui ~ Ber{0A) for i G S' and /d* = 0 for i ^ S. 


(ii) Compound symmetry. This example is from Example I in Fan and Lv (2008) and 


Example 3 in Wang (2009), where all predictors are equally correlated with correlation p, 


and we set p = 0.3, 0.6 or 0.9. The coefficients are set to be /?* = 5 for i = 1,..., 5 and A = 0 
otherwise. 

(iii) Autoregressive correlation. This correlation structure arises when the predictors 
are naturally ordered, for example in time series. The example used here is Example 2 in 
Wang (2009), modihed from the original example in Tibshirani ( 1996| . More specifically, each 
Xi follows a multivariate normal distribution, with cov{xi,Xj) = where p = 0.3, 0.6, 

or 0.9. The coefficients are specihed as 


Pi = 3, /34 = 1.5, Pj = 2, and Pi = 0 otherwise. 


(iv) Factor models. Factor models are useful for dimension reduction. Our example 


is taken from Meinshausen and Biihlmann (2010) and Cho and Fryzlewicz (2012). Let 
= 1,2,-- - ,A; be independent standard normal random variables. We set predictors 
as Xi = ^jfij + where fij and rji are generated from independent standard normal 
distributions. The number of the factors is chosen as k = 2 ,10 or 20 in the simulation while 
the coefficients are specihed the same as in Example (ii). 

(v) Group structure. Group structures depict a special correlation pattern. This example 


is similar to Example 4 of Zou and Hastie (2005), for which we allocate the 15 true variables 


into three groups. Specihcally, the predictors are generated as 


Xl+3m — Zi + N{0, 5^), X2+3m — ^2 + -^( 0 , h^), ^3+3^ — Z3 + N{0, S^), 
where m = 0,1,2, 3,4 and Zi ~ X(0, 1) are independent. The parameter controlling the 


strength of the group structure is hxed at 0.01 as in Zou and Hastie (2005), 0.05 or 0.1 for 


a more comprehensive evaluation. The coefficients are set as 


Pi = 3, z = 1,2,-- - ,15; Pi = 0, i = ,p. 


(vi) Extreme correlation. We generate this example to illustrate the performance of 


HOLP in extreme cases motivated by the challenging Example 4 in Wang (2009). As in 


Wang (2009), assuming Zi ~ A^(0,1) and Wi ~ A^(0,1), we generate the important x^s as 
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Xi = {Zi + Wi)/\/2,i = 1,2, ■ ,5 and x* = {zi + J2^j=iWj)/2,i = 16, - ■ ■ ,p. Setting the 
coefficients the same as in Example (ii), one can show that the correlation between the 
response and the trne predictors is no larger than two thirds of that between the response 
and the false predictors. Thus, the response variable is more correlated to a large number 
of unimportant variables. To make the example even more difficult, we assign another two 
unimportant predictors to be highly correlated with each true predictor. Specifically, we let 
Xi+s,Xi+ 2 s = Xi + A^(0,0.01), i = 1,2, ■■■ ,5. As a result, it will be extremely difficult to 
identify any important predictor. 

Table 1: Probability to include the true model when {p,n) = (10000,200) 



Example 


HOLP 

SIS 

RRCS 

ISIS 

FR 

Tilting 



(i) Independent predictors 


0.900 

0.940 

0.890 

0.620 

0.570 





CO 

o 

II 

0.310 

0.310 

0.250 

0.060 

0.020 

— 



(ii) Compound symmetry 

p = 0.6 

0.020 

0.020 

0.010 

0.000 

0.000 

— 




p = 0.9 

0.000 

0.000 

0.000 

0.000 

0.000 

~ 




CO 

o 

II 

0.810 

0.860 

0.760 

0.740 

0.740 

— 



(iii) Autoregressive 

p = 0.6 

1.000 

1.000 

1.000 

0.580 

0.680 

— 


= 50% 


p = 0.9 

1.000 

1.000 

1.000 

0.480 

0.390 

— 


II 

to 

0.450 

0.010 

0.010 

0.020 

0.240 

— 



(iv) Factor models 

/c = 10 

0.050 

0.000 

0.000 

0.000 

0.010 

— 




A: = 20 

0.030 

0.000 

0.000 

0.000 

0.000 

— 




,5^ = 0.1 

1.000 

1.000 

1.000 

0.000 

0.000 

— 



(v) Group structure 

5'^ = 0.05 

1.000 

1.000 

1.000 

0.000 

0.000 

— 




,52 = 0.01 

1.000 

1.000 

1.000 

0.000 

0.000 

— 



(vi) Extreme correlation 


0.580 

0.000 

0.000 

0.000 

0.040 

— 



(i) Independent predictors 


1.000 

1.000 

1.000 

1. 000 

1.000 





CO 

0 

II 

1.000 

0.820 

0.710 

1. 000 

1.000 

— 



(ii) Compound symmetry 

p = 0.6 

0.960 

0.550 

0.320 

0.420 

0.960 

— 




p = 0.9 

0.100 

0.030 

0.000 

0.000 

0.000 

— 




CO 

0 

II 

0.990 

0.990 

0.980 

I.OOO 

1.000 

— 



(iii) Autoregressive 

p = 0.6 

1.000 

1.000 

1.000 

I.OOO 

1.000 

— 

7^2 

= 90% 


II 

0 

1.000 

1.000 

1.000 

I.OOO 

1.000 

— 


II 

0.990 

0.010 

0.020 

0.350 

0.990 

— 



(iv) Factor model 

/c = 10 

0.850 

0.000 

0.000 

0.060 

0.700 

— 




/c = 20 

0.540 

0.000 

0.000 

0.010 

0.230 

— 




,5^ = 0.1 

1.000 

1.000 

1.000 

0.000 

0.000 

— 



(v) Group structure 

^2 = 0.05 

1.000 

1.000 

1.000 

0.000 

0.000 

— 




,52 = 0.01 

1.000 

1.000 

1.000 

0.000 

0.000 

~ 



(vi) Extreme correlation 


1.000 

0.000 

0.000 

0.000 

0.210 

— 


Brief summary of the simulation results 


The results for {p, n) = (1000,100) are shown in Table S.l in the Supplementary Materials 
and those for {p,n) = (10000,200) are in Table We summarize the results in following 
three points. First, when the signal-to-noise ratio is low, HOLP, RRCS and SIS outperform 
ISIS, FR and Tilting in Example (i), (ii), (iii), and (v). For the factor model (iv), neither 
SIS nor RRCS works while HOLP gives the best performance. In addition, HOLP seems 
to be the only effective screening method for the extreme correlation model (vi). The poor 
performance of ISIS, forward regression and tilting in selected scenarios of Example (ii). 
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(iii), and (v) might be caused by the low signal-to-noise ratio, as these methods all depend 
on the marginal residual deviance that is unreliable when the signal is weak. In particular, 
they require each true predictor to give the smallest marginal deviance at some step in order 
to be selected, imposing a strong condition for achieving satisfactory screening results. By 
contrast, SIS, RRCS and HOLP select the sub-model in one step and thus eliminate this 
strong requirement. The poor performance of SIS and RRCS in Example (iv) and (vi) might 
be caused by the violation of marginal correlation assumption Q as discussed before. 

Second, when the signal-to-noise ratio increases to 90%, signihcant improvements are 
seen for all methods. Remarkably, HOLP remains competitive and achieves an overall good 
performance. There are occasions where forward regression and tilting perform slightly better 
than HOLP, most of which, however, involve only relatively simple structures. The superior 
performance of forward regression and tilting under simple structures mainly beneht from 
their one-at-a-step screening strategy and the high signal-to-noise ratio. In the simulation 
study that is not presented here, we also implemented an iterative version of HOLP, which 
achieves a similar performance as forward regression and HOLP in most cases. Yet this 
strategy fails to a large extent for the group-structured correlation in Example (v). 

Another important feature of HOLP, RRCS and (I)SIS is the flexibility in adjusting the 
sub-model size. Unlike forward regression and tilting, no limitation is imposed on the sub¬ 
model size for HOLP, RRCS and (I)SIS. There might be an advantage to choose a sub-model 
of size greater than n, so that a better estimation or prediction accuracy can be achieved. 
For example, in Example (ii) when {p, n, p, = (10000, 200, 0.9, 90%), by selecting 200 co¬ 
variates, HOLP preserves the true model with probability 10%. This probability is improved 
to around 50% if the sub-model size increases to 1000, a ten-fold reduction in dimensionality 
still. In contrast to HOLP, it is impossible for forward regression and tilting to select a 
sub-model of size larger than n due to the lack of degrees of freedom. 

As shown in Section 3, HOLP relaxes the marginal correlation condition (|^ required 
by SIS. We verify this statement by comparing HOLP and SIS in a scenario where some 
important predictors are jointly correlated but marginally uncorrelated with the response. 
We take the setup in Example (ii) with the following model specihcation 

y = bxi + bx 2 + 5x3 + 5x4 — 20 px 5 -+- £. 

It is easy to verify that cox(x 5 , y) = 0, i.e., xs is marginally uncorrelated with y. We simulate 
200 data sets with (p, n) = (1000,100) or (p, n) = (10000,200) with different values of p. 
The probability of including the true model is plotted in Fig We see that HOLP performs 
universally better than SIS for any p. 
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p = 1000, n = 100 


p = 10000, n = 200 




Figure 2: Probability of including the true model for the example where is marginally 
uncorrelated but jointly correlated with y. 

4.2 Simulation study II: Verification of Theorem 2 and 3 

Theorem 2 and 3 state that HOLP and its ridge regression counterpart are able to separate 
the important variables from those unimportant ones with a large probability, and thus 
guarantee the effectiveness of variable screening. In particular, the two theorems indicate 
that by choosing a sub-model of size s, we are guaranteed to exactly select the true model. 
In this study, we revisit the examples in Simulation I by varying s to provide numerical 
evidences for this claim. Since there are multiple setups, for convenience we only look at 
Example (ii), (hi), (iv) and (v) by fixing the parameters at p = 0.5, k = 5, 6“^ = 0.01 for 
= 90% and p = 0.3, k = 2,5“^ = 0.01 for = 50% respectively. Because Example (vi) is 
difficult, in order to demonstrate the two theorems for moderate sample sizes, we relax the 
correlation between the important and unimportant predictors from 0.99 to 0.90 and use a 
different growing speed for the number of parameters for this case. To be precise, we set 


P = 


4 X [exp(n^/^)J for examples except Example (vi) 
20 X [exp(n^'^^)J for Example (vi) 


and 


s = 


1.5 X for R'^ = 90% 

for R^ = 50% 
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where [-J is the floor function. We vary the sample size from 50 to 500 with an increment of 
50 and simulate 50 data sets for each example. The probability that minjg^ |/3j| > max^^^ |/3j| 
is plotted in Figure]^ for HOLP and in Figure [ST] in Part D of the Supplementary Materials 
for the ridge HOLP with r = 10. 


HOLP with R = 90% 


HOLP with H= 50% 
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Figure 3: HOLP: P(minjg 5 |/3j| > maxj ^5 |/3i|) versus the sample size n. 


The increasing trend of the selection probability is explicitly illustrated in Figj^ Although 
not plotted, the probability for example (vi) when = 50% also tends to one if the sample 
size is further increased. Thus, we conclude that the probability of correctly identifying the 
importance rank tends to one as the sample size increases. A rough exponential pattern 
can be recognized from the curves, corresponding to the rate specified in Corollary 1. In 
addition, the probability of identifying the true model is quite similar between HOLP and 
Ridge-HOLP, echoing the statement we made at the beginning of Section 4. 


4.3 Simulation study III: Computation efficiency 

Computation efficiency is a vital concern for variable screening algorithms, as the primary 
motivation of screening is to assist variable selection methods, so that they are scalable to 
large data sets. In this section, we use Example (ii) in Simulation I with p = 0.9, n = 100 
and = 90% to illustrate the computation efficiency of HOLP as compared to SIS, ISIS, 
forward regression, and tilting. In Figure]^ we fix the data dimension at p = 1000, vary the 
select sub-model size from 1 to 100, and record the runtime for each method, while in Figure 
we fix the sub-model size at d = 50 and vary the data dimension p from 50 to 2500. Note 
that the R package ’SIS’ computes X'^Y in an inefficient way. For a fair comparison, we 
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write our own code for computing Because the computation complexity of tilting is 

significantly higher than all other methods, a separate plot excluding tilting is provided for 
each situation. 


p=1000,n=100 


p=1000,n=100 (tilting excluded) 


Tilting 

Forward regression 

iSiS 

HOLP 

SiS 

RRCS 


/ 




~T 

20 


~T 

40 


60 


80 


100 



Forward regression 

-A- 

ISiS 

-m- 

HOLP 

-B- 

SIS 


RRCS 


A-i-*- 




~r~ 

20 


~r~ 

60 


~r“ 

80 


40 


100 


selected submodel size (d) seiected submodei size (d) 

Figure 4: Computational time against the submodel size when (p,n) = (1000,100). 


d=50,n=100 


d=50,n=100 (tilting excluded) 



in 


o 
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Forward regression 

iSiS 

HOLP 

SiS 

-^r- RRCS 





500 1000 1500 2000 2500 


fuii modei size (p) 


Figure 5: Computational time against the total number of the covariates when (d, n) = 
(50,100). 


As can be seen from the figures, HOLP, RRCS and SIS are the three most efficient 
algorithms. RRCS is actually slightly slower than HOLP and SIS, but not significantly. 
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On the other hand, tilting demands the heaviest computational cost, followed by forward 
regression and ISIS. This result can be interpreted as follows. When p is hxed as in Figure 
1^ HOLP, RRCS and SIS only incurs a linear complexity on sub-model size d, whereas the 
complexity of forward regression is approximately quadratic and tilting is 0{k‘^cP + k^d) 


where k is the size of active set (Cho and Fryzlewicz, 2012). When d is hxed as in Figure 
the computational time for all methods other than tilting is linearly increasing on the 
total number of predictors p, while the time for tilting increasing quadratically with p. 
We thus conclude that SIS, RRCS and HOLP are the three preferred methods in terms of 
computational complexity. 


4.4 Simulation study IV: Performance comparison after screening 


Screening as a preselection step aims at assisting the second stage rehned analysis on pa¬ 
rameter estimation and variable selection. To fully investigate the impact of screening on 
the second stage analysis, we evaluate and compare different two-stage procedures where 
screening is followed by variable selection methods such as Lasso or SCAD, as well as these 
one-stage variable selection methods themselves. In this section, we look at the six examples 
in Simulation study I, where the parameters are hxed at p = 0.6, k = 10, = 0.01 and 

= 90%. To choose the tuning parameter in Lasso or SCAD, we make use of the extended 


BIC (Chen and Chen, 2008; Wang, 2009) to determine a hnal model that minimizes 


R,S S d 

EBIC = log- 1 —(logn -I- 21ogp), 

n n 


where d is the number of the predictors in the full model or selected sub-model. For all two- 
stage methods, we hrst choose a sub-model of size n, or use extended BIC to determine the 
sub-model size (only for HOLP-EBICS), and then apply either Lasso or SCAD to the sub¬ 
model to output the hnal result. We compare HOLP-Lasso, HOLP-SCAD, HOLP-EBICS 
(abbreviation for HOLP-EBIC-SCAD) to SIS-SCAD, RRCS-SCAD, ISIS-SCAD, Tilting, 
FR-Lasso, FR-SCAD, as well as Lasso and SCAD. The reason we only apply SCAD to SIS 


and ISIS is that SCAD is shown to achieve the best performance in the original paper (Fan 


and Lv, 2008). 


Finally, the performance is evaluated for each method in terms of the following measure¬ 
ments: the number of false negatives (#FNs, i.e., wrong zeros), the number of false positives 
(#FPs, i.e., wrong predictors), the probability that the selected model contains the true 
model (Coverage), the probability that the selected model is exactly the true model (Exact, 
i.e., no false positives or negatives), the estimation error (denoted as ||/3 — /3||2), the average 
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size of the selected model (Size), and the algorithm’s running time (in seconds per data set). 


As in Simulation study I, we simulate 200 data sets for (p, n) = (1000,100) and 100 data 
sets for (p,n) = (10000,200). There will be no results for tilting in the latter case because 
of the immense computational cost. The results for SIS is provided by the package ’SIS’, 
except for the computing time, which is recorded separately by calculating X'^Y directly as 
discussed before. All the simulations are run in single thread on PC with an 17-3770 CPU, 
where we use the package “glmnet” for the Lasso and “ncvreg” for the SCAD. 


Results of the nine methods are shown in Table S.2 in the Supplementary Materials 
and Table As can be seen, most methods work well for data sets with relatively simple 
structures, for example, the independent and autoregressive correlation structure; likewise, 
most of them fail for complicated ones, for example, the factor model with 10 factors. The 
results can be summarized in four main points. First, HOLP-SCAD achieves the smallest 
or close to the smallest estimation error for most cases. Second, SCAD has the overall best 
coverage probability and the smallest number of false negatives, followed closely by HOLP- 
SCAD and FR-SCAD. One potential caveat is, however, the high false positives for SCAD 
in many cases. Third, using extended BIC to determine the sub-model size can signihcantly 
reduce the false positive rate, although such gain is achieved at the expense of a higher false 
negative rate and a lower coverage probability. It is also worth noting that using extended 
BIC can further speed up two-stage methods. Finally, Lasso, HOLP-Lasso, HOLP-SCAD, 
RRCS-SCAD and SIS-SCAD are the most efficient algorithms in terms of computation. 


The simulation results suggest that HOLP can not only speed up Lasso and SCAD, 
but also maintain or even improve their performance in model selection and estimation. 
In particular, HOLP-SCAD achieves an overal attractive performance. We thus conclude 
that HOLP is an efficient and effective variable screening algorithm in helping down-stream 
analysis for parameter estimation and variable selection. 


4.5 A real data application 


This data set was used to study the mammalian eye diseases by Scheetz et al. (2006) where 


gene expressions on the eye tissues from 120 twelve-week-old male F2 rats were recorded. 
Among the genes under study, of particular interest is a gene coded as TRIM32 responsible 


for causing Bardet-Biedl syndrome (Chiang et ah, 2006) 


Following Scheetz et al. (2006), we choose 18976 probe sets as they exhibited sufficient 


signal for reliable analysis and at least 2-fold variation in expressions. The intensity values of 
these genes are evaluated in the logarithm scale and normalized using the method in|Irizari5^ 


et al. (2003). Because TRIM32 is believed to be only linked to a small number of genes, we 
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Table 2: Model selection results for (p, n) = (10000,200) 


example 


#ENs 

#FPs 

Coverage(%) 

Exact (%) 

Size 

II/3-/3||2 

time (sec) 


Lasso 

0.00 

0.20 

100.0 

78.0 

5.20 

1.21 

1.15 


SCAD 

0.00 

0.00 

100.0 

100.0 

5.00 

0.26 

18.79 

(i) Independent 
predictors 

ISIS-SCAD 

0.00 

0.00 

100.0 

100.0 

5.00 

0.26 

211.6 

SIS-SCAD 

0.04 

0.00 

96.0 

96.0 

4.96 

0.42 

0.88 

RRCS-SCAD 

0.07 

0.00 

93.0 

93.0 

4.93 

0.53 

18.12 

5 = 5,||/3||2 =3.8 

FR-Lasso 

0.00 

0.32 

100.0 

78.0 

5.32 

1.04 

5246.6 

FR-SCAD 

0.00 

0.00 

100.0 

100.0 

5.00 

0.26 

5247.2 


HOLP-Lasso 

0.02 

0.20 

98.0 

78.0 

5.18 

1.21 

0.45 


HOLP-SCAD 

0.02 

0.00 

98.0 

98.0 

4.98 

0.29 

0.97 


HOLP-EBICS 

Tilting 

0.19 

0.00 

82.0 

82.0 

4.81 

0.55 

1.19 


Lasso 

1.56 

2.41 

34.0 

0.0 

5.85 

9.00 

1.51 


SCAD 

0.01 

3.65 

99.0 

6.0 

8.64 

4.10 

251.5 

(ii) Compound 

ISIS-SCAD 

1.20 

5.25 

38.0 

15.0 

9.05 

7.24 

465.1 

SIS-SCAD 

1.51 

6.19 

26.0 

19.0 

9.68 

7.97 

3.84 

symmetry 

RRCS-SCAD 

1.72 

6.27 

22.0 

17.0 

9.55 

8.23 

25.26 

s = 5,||/3||2 = 8.6 

FR-Lasso 

0.14 

6.89 

86.0 

0.0 

11.95 

7.61 

6904.2 

FR-SCAD 

0.20 

3.35 

85.0 

6.0 

8.15 

4.80 

6909.3 


HOLP-Lasso 

1.24 

2.65 

45.0 

4.0 

6.41 

8.55 

0.60 


HOLP-SCAD 

0.04 

3.61 

96.0 

10.0 

8.57 

2.79 

4.30 


HOLP-EBICS 

Tilting 

0.25 

1.22 

77.0 

45.0 

4.97 

3.72 

1.43 


Lasso 

0.00 

1.06 

100.0 

0.0 

4.06 

0.62 

2.41 


SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

34.53 

(iii) Autoregressive 
correlation 

ISIS-SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

342.8 

SIS-SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

1.44 

RRCS-SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

23.13 

5 = 3,||/3||2 = 3.9 

FR-Lasso 

0.00 

1.13 

100.0 

0.0 

4.13 

0.56 

10251.2 

FR-SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

10252.1 


HOLP-Lasso 

0.00 

1.12 

100.0 

0.0 

4.12 

0.60 

1.10 


HOLP-SCAD 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

1.78 


HOLP-EBICS 

Tilting 

0.00 

0.00 

100.0 

100.0 

3.00 

0.16 

2.21 


Lasso 

4.79 

6.17 

0.0 

0.0 

6.38 

11.32 

1.46 


SCAD 

0.11 

21.08 

91.0 

4.0 

25.97 

9.41 

76.30 


ISIS-SCAD 

3.09 

18.06 

3.0 

3.0 

19.97 

14.27 

409.8 

(iv) Factor Models 

SIS-SCAD 

4.49 

7.95 

0.0 

0.0 

8.46 

12.45 

3.34 


RRCS-SCAD 

4.47 

8.16 

0.0 

0.0 

8.69 

12.50 

25.80 

5 = 5, ||^||2 = 8.6 

FR-Lasso 

3.54 

4.45 

13.0 

0.0 

5.91 

19.40 

7340.1 


FR-SCAD 

1.12 

21.89 

58.0 

6.0 

25.77 

17.18 

7341.8 


HOLP-Lasso 

3.91 

6.00 

1.0 

0.0 

7.09 

11.36 

0.58 


HOLP-SCAD 

0.54 

14.02 

68.0 

7.0 

18.48 

8.83 

3.00 


HOLP-EBICS 

Tilting 

1.70 

9.30 

25.0 

10.0 

22.60 

10.56 

1.69 


Lasso 

7.82 

0.10 

0.0 

0.0 

7.27 

13.14 

1.51 


SCAD 

11.99 

115.40 

0.0 

0.0 

118.44 

25.22 

65.67 

(v) Group 
structure 

ISIS-SCAD 

12.00 

26.06 

0.0 

0.0 

29.06 

22.70 

490.4 

SIS-SCAD 

11.98 

21.73 

0.0 

0.0 

24.75 

22.68 

2.19 

RRCS-SCAD 

11.98 

21.13 

0.0 

0.0 

24.15 

22.77 

20.13 

5 = 5,||/3||2 = 19.4 

FR-Lasso 

11.75 

0.89 

0.0 

0.0 

4.14 

19.43 

6916.9 

FR-SCAD 

11.96 

21.50 

0.0 

0.0 

24.54 

25.40 

6918.0 


HOLP-Lasso 

7.75 

0.11 

0.0 

0.0 

7.36 

13.14 

0.62 


HOLP-SCAD 

11.98 

21.95 

0.0 

0.0 

24.97 

22.48 

2.46 


HOLP-EBICS 

Tilting 

11.98 

0.92 

0.0 

0.0 

3.94 

23.23 

1.43 


Lasso 

1.06 

11.46 

0.0 

0.0 

15.40 

8.60 

1.34 


SCAD 

0.00 

0.00 

100.0 

100.0 

5.00 

0.54 

105.2 

(vi) Extreme 
correlation 

ISIS-SCAD 

4.97 

3.81 

0.0 

0.0 

3.85 

13.18 

507.4 

SIS-SCAD 

4.93 

2.67 

0.0 

0.0 

2.74 

12.10 

3.55 

RRCS-SCAD 

5.00 

2.70 

0.0 

0.0 

2.70 

12.10 

27.75 

^ = 5,||/3||2 = 8.6 

FR-Lasso 

2.41 

6.32 

3.0 

0.0 

8.89 

10.30 

7317.6 

FR-SCAD 

2.54 

2.54 

3.0 

3.0 

5.00 

11.21 

7319.2 


HOLP-Lasso 

0.89 

10.72 

42.0 

0.0 

14.83 

7.82 

0.43 


HOLP-SCAD 

0.00 

0.00 

100.0 

100.0 

5.00 

0.54 

2.70 


HOLP-EBICS 

Tilting 

0.70 

0.70 

40.0 

40.0 

5.00 

2.17 

1.51 
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confine onr attention to the top 5000 genes with the highest sample variance. For comparison, 
the nine methods in simulation study IV are examined via 10-fold cross validation and the 
selected models are refitted via ordinary least squares for prediction purposes. We report 
the means and the standard errors of the mean square errors for prediction and the final 
chosen model size in Table As a reference, we also report these values for the null model. 

Table 3: The 10-fold cross validation error for nine different methods 


Methods 

Mean errors 

Standard errors 

Final size (median) 

Lasso 

0.011 

0.009 

5 

SCAD 

0.015 

0.011 

4 

ISIS-SCAD 

0.012 

0.006 

4 

SIS-SCAD 

0.010 

0.004 

3 

RRCS-SCAD 

0.010 

0.006 

2 

FR-Lasso 

0.016 

0.019 

4 

FR-SCAD 

0.014 

0.014 

3 

HOLP-Lasso 

0.012 

0.006 

5 

HOLP-SCAD 

0.010 

0.006 

5 

HOLP-EBICS 

0.010 

0.006 

5 

tilting 

0.017 

0.021 

6 

NULL 

0.021 

0.025 

0 


From Table|^ it can be seen that models selected by HOLP-SCAD, SIS-SCAD and RRCS- 
SCAD achieve the smallest cross-validation error. It might also be interesting to compare 
the selected genes by using the full data set, of which a detailed discussion is provided in Part 


E and Table S.3 in the Supplementary Materials. In particular, gene BE107075 is chosen by 


all methods other than tilting. As reported in Breheny and Huang (2013), this gene is also 
selected via group Lasso and group SCAD. 


5 Conclusion 

In this article, we propose a simple, efficient, easy-to-implement, and flexible method HOLP 
for screening variables in high dimensional feature space. Compared to other one-stage 
screening methods such as SIS, HOLP does not require the strong marginal correlation 
assumption. Compared to iterative screening methods such as forward regression and tilting, 
HOLP can be more efficiently computed. Thus, it seems that HOLP holds the two keys at the 
same time for successful screening: flexible conditions and attractive computation efficiency. 
Extensive simulation studies show that the performance of HOLP is very competitive, often 
among the best approaches for screening variables under diverse circumstances with small 
demand on computational resources. Finally, HOLP is naturally connected to the familiar 
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least-squares estimate for low dimensional data analysis and can be understood as the ridge 
regression estimate when the ridge parameter goes to zero. 

When n ~ p, concerns are raised for the HOLP as XX"’" is close to degeneracy. While 
the screening matrix = UU'^ remains diagonally dominant, the noise term 

X'^{XX'^)~^e = UD~^V"^e explodes in magnitude and may dominate the signal, affecting 
the performance of HOLP. We illustrate this phenomenon via Example (ii) in Section 4.1 
with p hxed at 1000 and = 90% for various sample sizes. The probability of including 
the true model by retaining a sub-model with size min{n, 100} is plotted in Fig[^ (left). It 


HOLP Ridge-HOLP (r= 10) Divide-HOLP 



Sample size n Sample size n Sample size n 

Figure 6: Performance of HOLP, Ridge-HOLP and Divide-HOLP for p = 1000. 


can be seen that the screening accuracy of HOLP deteriorates whenever n becomes close to 
p. We propose two methods to overcome this issue. 


Ridge-HOLP: As presented in Theorem 3, one approach is to use Ridge-HOLP by 
introducing the ridge parameter r to control the explosion of the noise term. In fact, 
one can show that araax{X'^{XX'^ + rln)~^) < r~^amaxiX), where amaxiX) ^ 0{y/p + 


^Jn) PS 0{y/n) with large probability. See Vershynin (2010). We verify the performance 
of Ridge-HOLP via the same example and plot the result with r = 10 in Fig [^(middle). 


• Divide-HOLP: A second approach is to employ the “divide-conquer-combine” strat¬ 
egy, where we randomly partition the data into m subsets, apply HOLP on each to 
obtain m reduced models (with a size of min{n/m, 100/m}), and combine the results. 
This approach ensures Assumption Al is satished on each subset and can be shown to 
achieve the same convergence rate as if the data set were not partitioned. In addition, 
it reduces the computational complexity from 0{n‘^p) to 0{'n?p/m). The result on the 
same example is shown in Fig(right) with m = 2. The performance of Divide-HOLP 
is on par with Ridge-HOLP when n is close to p. 


There are several directions to further the study on HOLP. First, it is of great interest 
to extend HOLP to deal with a larger class of models such as generalized linear models. 
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To address this problem, we may make use of a ridge regression version of HOLP and 
study extensions of the results presented in this paper. Second, we may want to study 
the screening problem for generalized additive models where nonlinearity is present. Third, 
HOLP may be used in compressed sensing (Donoho, [2006 ) as in Xue and Zou (2011) for 
exactly recovering the important variables if the sensing matrix satishes some properties. 
Fourth, we are currently applying the proposed framework for screening variables in Gaussian 
graphical models. The results will be reported elsewhere. 
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Supplementary Materials to ” High-dimensional 
Ordinary Least-squares Projection for Screening 

Variables” 


A: Additional results 

The Moore-Penrose inverse 

Definition 6.1. For A G a Moore-Penrose pseudo-inverse of A is defined as a matrix 

G such that 

AA^A = A, a+am = a+, (aa+)* = aa+, (a+a)* = a+a, 

where A* is the conjugate of A. 

Using this definition, we can verify X~^ = for p> n and X~^ = {X'^X)~^X'^ 

ioT p < n are both the Moore-Penrose inverse of X. 

The ridge regression estimator when r —)► 0 

Applying the Sherman-Morrison-Woodbury formula 

(a + UDV)-^ = a-^ - A-^U{D-^ + VA-^U)-^VA-\ 

we have 

r(r/p + X'^X)-^ = Ip- X^{In + f = Ip - X^{rln + XX^y^X. 

Multiplying X^Y on both sides, we get 

r{rlp + X^Xy^X^Y = X^Y - X^(rJ„ + XX^-^XX^Y. 

The right hand side can be further simplified as 

x^Y - x^yy + xxy-^xx^Y 

= x^Y - x^yy + xxy-^yy + xx^ - ry)Y 
= x^Y - x^Y + ryy + xxy-^Y = rx^yy + xxy-^Y. 

Therefore, we have 

yip + x'^xy^x^Y = x^yy + xxy-^Y. 


B: A brief review of the Stiefel manifold 


Let P E 0{p) he a. p X p orthogonal matrix from the orthogonal group 0{p). Let H denote 


the first n columns of P. Then H is in the Stiefel manifold (Chikuse, 2003). In general, the 


Stiefel manifold ia,p is the space whose points are n-frames in VA represented as the set of 
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p X n matrices X such that X'^X = In- Mathematically, we can write 


Vn,p = {Xe = In}. 


There is a natural measure {dX) called Haar measure on the Stiefel manifold, invariant under 
both right orthogonal and left orthogonal transformations. We standardize it to obtain a 
probability measure as [dX] = {dX)/V{n,p), where V{n,p) = 2"'7r"'^’/^/r„(l/2p). Let Rp^n 
be the space formed by all p x n nonsingular matrices. There are several useful results for 
the distributions on Rp^n and Vn,p, which will be utilized in the following sections. 


Lemma 1. ^Fa^^and^Tv, 2008) An nxp matrix Z can he decomposed as Z = VDU"’" via the 
singular value decomposition, where V E 0{n),U E Vn^p and D is an n x n diagonal matrix. 
Let zj denote the ith row of Z, i = 1,2, ■■■ ,n. If we assume that ZiS are independent and 
their distribution is invariant under right orthogonal transformation, then the distribution of 
Z is also invariant under 0{p), i.e. 


ZT Z, for T eO{p). 


As a result, we have 

{In,0p-n)xU, 

where U is uniformly distributed on 0{p). That is, U is uniformly distributed on Vn,p. 


Consider a different matrix decomposition. For a p x n matrix Z, dehne and as 

H, = Z{Z^Z)-^/^, T, = Z^Z. 


Then E Vn,p and Z = . This is called matrix polar decomposition, where Hz is 

the orientation of the matrix Z. We cite the following result for the polar decomposition. 


Lemma 2. 2003, Page fl-ff) Supposed that a p x n random matrix Z has the 

density function of the form 


fz{Z) = 

which is invariant under the right-orthogonal transformation of Z, where S is a pxp positive 
definite matrix. Then its orientation Hz has the matrix angular central Gaussian distribution 
(MACG) with a probability density function 

MACGiT) = 

In particular, if Z is a p x n matrix whose distribution is invariant under both the left- and 
right-orthogonal transformations, then Hy, with Y = BZ for BB"’" = S, has the MACGiTi) 
distribution. 


When n = 1, the MACG distribution becomes the angular central Gaussian distribution, 
a description of the multivariate Gaussian distribution on the unite sphere (Watson, 1983). 


Lemma 3. (Ghikuse, 2003, Page 10, Decomposition of the Stiefel manifold) Let H be a pxn 


random matrix on Vn^p, and write 


H = {H, H2), 





















with Hi being a p x q matrix where 0 < q < n. Then we can write 


H2 = G{Hi)Ui, 

where G{Hi) is any matrix chosen so that {Hi G{Hi)) G 0{j)); as H 2 runs over Vn-q,p, Ui 
runs over Vn-q^p-q and the relationship is one to one. The differential form [dH] for the 
normalized invariant measure on Vn^p is decomposed as the product 

[dH] = [dHi][dUi] 

of those [dHi] and [dUi] on Vg^p and Vn-q^p-q, respectively. 


C: Proofs of the main theory 


The framework of the proof follows Fan and Lv (2008), but with many modihcations in 
details. Recall the proposed HOLP screening estimator 


f = X^{XX^)-^Y = X^{XX^)-^Xf + X^{XX^)-h :=^ + p, 


where f can be seen as the signal part and p the noise part. 

Consider the singular value decomposition oi Z as Z = VDU^, where V G 0{n), U G 14,p 
and Zi) is an n by n diagonal matrix. This gives X = ZYfG = VDU'^YfG, Hence, the 
projection matrix can be written as 

X'^{XX^)-^X = Y}l‘^UDV^{yDU^TUDV^Y^VDU^Y}l‘^ 

where H = satisfying H^H = I. In fact, H is the orientation of the 

matrix Because Z is sphere symmetrically distributed and thus invariant under right 

orthogonal transformation, by Lemma 1, Z/ is then uniformly distributed on the Stiefel man¬ 
ifold Vn^p, meaning that it is invariant under both left- and right-orthogonal transformation. 
Therefore, by Lemma 2, the matrix H has the MACG(S) distribution with regard to the 
Haar measure on Vn,p as 

and we can write f in terms of H as 

e = hh^y 


The whole proof depends on the properties of f and p, where f requires more elaborate 
analysis. Throughout the whole proof section, || ■ || denotes the I 2 norm of a vector. The 
following preliminary results are the foundation of the whole theory. 


Property of HH^ff 

In this part, we aim to evaluate the magnitude of HH^fd. Let = (0, • ■ ■ , 1, 0, ■ ■ ■ , 0)^ 
denote the fth natural base in the p dimension space and ei denote the n-dimensional column 
vector (1, 0, • • ■ , 0)^. We have the following two lemmas. 
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Lemma 4. If assumption A1 and A3 hold, forC > 0 and for any fixed vector v with ||i;|| = 1, 
there exist constants c[, C 2 with 0 < c'^ < 1 < C 2 such that 


P( v^HH^v < c[ 


n 


l — T 


or HH^v > Cn 


n 


l+r 


< 4e 


-Cn 


P p 

In particular for v = (3, whose norm is not 1 though, a similar inequality holds for one side 

1+T " 


with a new C 2 as 


P( > 4 


n 


p 


< 2e 


-Cn 


Lemma 5. If assumption Al and A3 hold, then for any C > 0, there exists some c, c > 0 
such that for any i E S, 


p{ < c 


n 


1 — T—K 


p 


< 0< exp 


-Cn 


l — bT—2K—U 


2 logn 


and for any i ^ S, 


P( > 


c n 


l — T — K. 


< 0< exp 


-Cn 




2 logn 


\/logn p 

where r, k, u are the parameters defined in A3. 

Lemma 6. Assume A1-A3 hold, we have for any i E {1, 2, • • ■ ,n} 


1^*1 > <exp|l-g(^ ^^^^^^ -) |^+3exp(-C'in) 


\/\ogn p 

where Ci,ci,C 4 are defined in the assumption, and 4 is defined in Lemma^ 


Proof of the three lemmas 

To prove Lemma we need the following two propositions, first of which is Lemma 3 and 
the second of which is similar to Lemma 4 in Fan and Lv (2008). For completeness, we 


provide the proof for the second proposition right after the statement. 


Proposition 1 (Lemma 3 in Fan and Lv (2008)). Let 4, * = 1, 2, • • • ,n be i.i.d x\-distributed 
random variables. Then, 


(i) for any e > 0, we have 


P(n ^5^4 > 1 + e) < e 


2=1 




where A,, = [e — log(l + e)]/2 > 0. 
(a) for any e > 0, we have 


P(n < 1 - e) < e 


2=1 


—BeU 
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where = [—e — log(l — e)]/2 > 0. 

In other words, for any C > 0, there exists some 0 < Cg < 1 < C 4 such that 


P[n > cl) < 


i=l 


-.—Cn 


and 


P{'n- < cl) < 


2=1 


—Cn 


Proposition 2. Let U he uniformly distributed on the Stiefel manifold Then for any 

C > 0, there exist c\, with 0 < c'^ < 1 < c' 2 , such that 

p(e^UU'^ei<c[- or e{UU'^d > < Ae-^^. 

\ P Pj 

Proof. First, U'^ can be written as (/„ 0„,p_„)f/, where U is nniformly distributed on 0{p). 
Apparently, Uei is uniformly distributed on the unite sphere S^~^. Thus, letting {xi,i = 
1 , 2 , ■ ■ ■ ,p} be i.i.d random variables following A^( 0 , 1 ), we have 


T~T iD 
u Cl — 


Xi 


X2 


Xr, 


2^.7 = 1 \ 2^7 = 1 


^2 


2^j = l Xj 


Hence U'^ei is the hrst n coordinates of JJei. It follows 


xl + xl^ - v xl 

From Proposition!^ we know that for any C > 0, there exist some Ci and 62 such that 


Pi > Cl ] < 


n 




and 


P 


(TL 


^2 
1 Xi 


1 - > Cl < e , . . 

\ P J \ P 

Letting c'l = C 2 /C 1 , c^ = Ci/c 2 and by Bonferroni’s inequality, we have 

P (e^UU^Cl < c'g— or e'^UU'^ei > c^—^ < 4e“^"'. 
\ P Pj 

The proof is completed. 


-Cp 


P 


n 




<C 2 < e 


-Cp 


□ 


Proof of Lemma Recall the dehnition of H and 
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There always exists some orthogonal matrix Q that rotates the vector to the direction 
of Cl, i.e, 

= llE^nllQci. 

Then we have 

v'^HH^v = \\Ehfe^Q^U{U^EU)-^U^Qei = \\E-2vfe^U{U^EU)-^Uei, 

where U = Q^U is uniformly distributed on Vn^p, since U is uniformly distributed on I 4 p (see 
discussion in the beginning) and Haar measure is invariant under orthogonal transformation. 
Now the magnitude of v^HH^v can be evaluated in two parts. For the norm of the vector 
San, we have 

Amm(S) < = IlS^niP < Ama^(S), (5) 

and for the rest part, 

and 

Consequently, we have 

T TT ttT ^ Amax(F) rp j, rp j, Am,m(FI) rp p 

V HH v<- - 7 ^e({/{/ ei, v HH v>- -{/{/ Ci. (6) 

'^maxy^) 

Therefore, following Proposition and A3, for any C > 0 we have 
pfv^HH^v < c[ci —— or v^HH^v > C 2 C'^^- -^ 

V P P J 

Denoting c[c 4 by c[ and € 204 ^ by C 2 , we obtain the equation in the lemma. 

Next for v = /3, it follows from Assumption A3 that 

var(Y) =/3^E/^ + a^ = 0(1). (7) 

Equation (|^ then can be updated as 

< c' 

for some constant c', and (§ now becomes 

Since the trace of the covariance matrix S is p, which entails that Amaa;(S) ^ 1 and X^ini^) < 
1 . Now with assumption A3, we have 

\ \ Xmin{^) ^ _i -r /o\ 

^ T /y,\ ^ Tl . ( 8 ) 
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Combining the above two equations, we have that for some new C 2 > 0, it holds 


P > c' 


n 


l + T 


P 


< 2e 


-Cn 


□ 

The proof of Lemma [^relies on the results from Stiefel manifold. We hrst prove following 
propositions, which can assist the proof of Lemma 

Proposition 3. Assume a p xn matrix H G W,,p follows the Matrix Angular Central Gaus¬ 
sian distribution with covariance matrix S. From Lemma^we can decompose H = {Ti,H 2 ) 
with Ti = G{H 2 )Hi, where H 2 is a p x {n — q) matrix, Hi is a {p — n + q) x q matrix and 
G{H 2 ) is a matrix such that {G{H 2 ), H 2 ) G 0{p). We have following result 

Hi\H2r~^ MAGG{G{H2fm{H2)) (9) 

with regard to the invariant measure [Hi] on 

Proof. Recall that H follows a MAGG(J2) on which possesses a density as 

p{H) oc \H^E-^H\-P^^[dH]. 

Using the identity for matrix determinant 

= \A\\D - GA-^B\ = \D\\A - BD-^G\, 


A B 
G D 


we have 

P{Hi,H2) oc \HIT.-^H2\~^'‘^{T^T^-^Ti - T^T.-^H2{H^T.-^H2)~^H'^T^-^Ti)-p/‘^ 

= \HlT.-^H2\-^/‘^{H^G{H2f{T^-^ - T.-^H2{H1^T.-^H2)-^H^T^-^)G{H2 )Hi)-p/‘^ 
= \HlT.-^H2\-^/‘^{H^G{H2fT^~^^‘^{I - T2)T.-^/‘^G{H2 )Hi)-p/‘^ , 

where T 2 = T.-^/^H 2 {HIT.-^H 2 )-^H^T.-^/^ is an orthogonal projection onto the linear space 
spanned by the columns of It is easy to verify the following result by using the 

dehnition of G{H 2 ), 

[E^/^G{H2){G{H2fm{H2))-^/\ E-^/^H2 {HJe-^H2)-^^^] G 0{p), 
and therefore we have 

I-T2 = E^^^G{H2){G{H2fEG{H2))-^G{H2fE^^‘^, 

which simplihes the density function as 

P{Hi,H2) oc \H^E-^H2\-P/\Hl{G{H2fj:G{H2))-^Hi)-P/\ 

Now it becomes clear that Hi\H 2 follows the Matrix Angular Central Gaussian distribution 
ACG(S'), where 


S' = G{H2f^G{H2). 
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This completes the proof. □ 

Proposition 4. Assume H e 14,p- Write H = (Ti,hf 2 ) where Ti = 
is the first eolumn of H, then we have 




= elHH^ei. 


Proof. Notice that for any orthogonal matrix Q G 0{n), we have 

ejHH^e2 = e^HQQ^H^e2 = elH'H'^e2. 

Write H' = HQ = {TfiH'Q, where T[ = hf' = If we choose 

Q snch that the first row of (this is possible as we can choose the first colnmn 

of Q being the first row of H npon normalizing), i.e., 


efH' = [T[^^\ O,-- - ,0] 


then immediately we have efHH^e 2 


efH'H'^e 2 = This indicates that 


e{HH^e2 = 


id) 


r-pA)rp{2) 


eiH2 = 0 . 


Next, we transform the condition e\H 2 = Q to the constraint on the distribntion of T^\ 
Letting t\ = efHH^ei, then ejH 2 = 0 is eqnivalent to = ejHH^ei = tf, which implies 
that 


(d) 




ei HH^ 62 = Tfi>Tl 


= e^HH^e,. 


□ 


Proposition 5. Assume the conditional number of S is condiTi) and Hu = 1 for i = 
1 , 2, • • • ,p, then we have 




1 


and XmaxiH) < condfH). 


condiH) 

Proof. Notice that p = tr{H) = Therefore, we have 


P/^max P 


p 


cond{H) 

which completes the proof. 

We now tnrn to the proof of Lemma 


and p/Xmin{H) < p ■ cond(S), 


□ 


Proof of Lemma^ Notice that to qnantify is essential to qnantify the entries 

of HH^. The diagonal terms are already stndied in Lemma as taking v = ei we have 


n 


1 — T 


P[ ejHH^ei < c[ -or ejHH^ei > 4 


n 


l+r 


p 


p 


< 4e 


—Cn 


( 10 ) 
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The remaining task is to quantify off diagonal terms. Without loss of generality, we prove 
the bound only for ef if if ^ 62 , then the other off-diagonal terms should follow exactly the 
same argument. According to Proposition]^ with q being 1, we can decompose if = (Ti,if 2 ) 
with Ti = G{H 2 )Hi, where if 2 is apx (u — 1 ) matrix, Hi is a (p —n + 1 ) x 1 vector and G(if 2 ) 
is a matrix such that {G{H 2 ), H 2 ) € (P(p).The invariant measure on the Stiefel manifold can 
be decomposed as 


IH] = 

where [ifi] and [if 2 ] are Haar measures on Vi^n-p+i^ ifi|if 2 follows the Angular Central 

Gaussian distribution ACG(S'), where 

S' = G{H2fT.G{H2). 


Let ifi = (fi, ^ 2 , ■ ■ ■ , hpY and let = (xi, X 2 , - ■ ■ , Xp-n+i) ~ A^(0, S'), then we have 


hi 


(£) 



2 

p—n-\-l 


Notice that Ti = G{H 2 )Hi, a linear transformation on Hi. Dehning y = G{H 2 )x, we have 


Tii) (J 



( 11 ) 


where y ~ iV(0, G(if)S'G'(if)^) is a degenerate Gaussian distribution. This degenerate 
distribution contains an interesting form. Letting x ~ A^(0, S), we know y can be expressed 
as y = G{H)G{H)^z. Write G(ff 2 )^ as [gi,g 2 \ where is a (p — n -f 1) x 1 vector and p 2 is 
a (p — n + 1) X (p — 1) matrix, then we have 

G{H2)G{H2f = . 

\g2 9i gig2j 


We can also write ffj = [0„_i_i, f. 2 ] where ^2 is a (n — 1) x (p — 1) matrix, and using the 
orthogonality, i.e., [ff 2 G{H 2 )][H 2 G{H 2 )Y = Ip, we have 

g^gi = 1 , glg 2 = Oi,p_i and glg 2 = Ip-i - h 2 h 1 . 

Because ^2 is a set of orthogonal basis in the p — 1 dimensional space, pjp 2 is therefore an 
orthogonal projection onto the space {h 2 }^ and g^gi = AA^ where A = g 2 {g 2 g 2 )~^^‘^ is a 
(p — 1) X (p — n) orientation matrix on {h 2 }^. Together, we have 


»=(o AA^)~ 


This relationship allows us to marginalize pi out with y following a degenerate Gaussian 
distribution. 
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Now according to Proposition and assuming tf = ejHH^ei, we have 




_ 4.2 

r 1 — 


Because the magnitude of ti has been obtained in (10), we can now condition on the 
value of to obtain the bound on From = tf, we have 

+ + + ( 12 ) 

Notice this constraint is imposed on the norm of y = (y 2 , 2 / 3 , ■ ,2/p) and is thus independent 

of ( 2 / 2 / 11 ^ 11 , • • ■ ,2/p/ll^ll)- Equation (12) also implies that 


(l-^f)(2/f + 2/2^ + --- + 2/p^)=2/2^ + 2/| + --- + 2/p^ 


(13) 


Therefore, combining (11) with (12), (13) and integrating 2/1 out, we have 


rpM I ^(1) _ 4 4 — 00... „ 

-‘I I -‘I — — /-, f ,P, 

^Jy| + ■■■ + yl 


where ( 2 / 2 , 2 / 3 ,' ‘ ‘ , 2/p) ~ -^(0, AA^T, 22 A.A^) with S 22 being the covariance matrix of 2 : 2 , • • • , Zp. 

To bound the numerator, we use the classical tail bound on the normal distribution as 
for any f > 0, (aj = \/var{yi) < ^X^ax{,AA^Y, 22 AA^) < Amarc(E)^/^), 


P{\yi\ > tai) < 2 e 

For any a > 0 choosing t = we have. 


(14) 


P{\yi\ > —— 4 == —n 2 ) < 2exp 


■y/logn ^ ^ 1^ 21ogn 

For the denominator, letting 5 ~ iV(0, Ip-i), we have 


p—n 


y = AA^Y^-'tz and fy = AA^Y^'^z = 


2=1 


where T’j^(l) are iid chi-square random variables and Aj are non-zero eigenvalues of matrix 
yI{^AA^yI{\ Here A j’s are naturally upper bounded by Amaa:(E). To give a lower bound, 
notice that Y^^AA^Y^^ and ^4^22^^ possess the same set of non-zero eigenvalues, thus 

min A^ P_ X^in{^AY 22 A ) ^ A min (E). 


Therefore, 




The quantity 




p—n 


p — n p — n 

can be bounded 


p — n 


by Proposition Combining with Proposition 
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we have for any C > 0, there exists some Cg > 0 such that 


P y^y/{p-n) < 


< p-Civ-n) ^ 


^min (S). 

Therefore, noticing that cond(JP) = XmaxiJPj/^miniJPj < 04 / 1 "^, can be bounded as 


P{ |Tf^| > 


- riy/logn 


=ti) < + 2 


exp 


-Cn 


l-2a 


Using the results from (10), we have 

a+r' 


F > c' 


n 


P 


<2e and P[ti<c[ 


n 


l — T 


P 


< 2e 


2 logn 


-Cn 


Consequently, dehning M = ^ we have 

VCsho-i) 


F \e{HH^e2\ > 


M 

Vlogn p 


= P[ > 


M 


< F( > c' 2 — |Tf^ = tl ) + F( |Tf^1 > 


P 


\/\ogn p 
(2) I ^ y/Cc^n 


= h 


V^a/Io^ 


= tl 


< + 4e"^" + 2 


exp 


-Cn 


l-2a 


2 logn 


= 0{ exp 


-Cn 


l-2a 


21 ogn 


This result provides an upper bound on off diagonal terms. Using this result, we have for 
any i ^ S 




j&s j£S 

y/dCiM 77,1+3F2+U2- 


<JY, \<!THHni\^-mh< 

j&s 


Vlogn 


P 


(15) 


with probability at least 1 — 0 < n'^ exp 


2 log n 


The last inequality is due to Assumption 

3 that var{Y) = 0(1), which implies 

cj^||/3|pn“^ < ||/3|pAmi„(S) < = var{Y) -a^ <d (16) 


for some constant d. Taking a = (5/2)r + k + i//2 in (15), we have 

-,1 — T — K.\ /_ ^l^l — bT—2K—U 


P[\eiHH^P\> 


c n 


= 0 { exp 


(17) 


A/logn p ) ^ 2 logn 

where c = \/ dc^M and C is any constant that is less than C (O is also an arbitrary constant). 
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Next, for i G S', combining (15) with (10) and using the same value of a yields 


jes 


> C 1 C 2 


n 


l — r—K 


c n 


l — T — K 


p ^/\ogn p 


> 


C\C2 
2 p 


with probability at least 1 — 2e — oln'^exp ^ 
have 


1 —5t —2« —n" 


2 log n 


. Letting c = we 


P\V-EE-?\<c''—yO{..^ 21og„ 


1 — r—K 


_ ^l — br—2K,— u 


which completes the proof. 


( 18 ) 


□ 


Proof of Lemma 1^ Recall the random variable pi = ejp = eJX'^{XX'^) ^e. If we dehne 

a = efX^{XX^)-^/\\efX^{XX^)-%, 
then a is independent of e and 

Pi = ||efX^(XX '^)“^||2 ■ aw, 

where tc is a standardized random variable such that w = aFe/a. Now for the norm we have 

||efX^(XX^)-^||2 = efX^(XX^)-2Xei = X^ {XX^)-^d(^xX^)-^{XX^)-^dxei 
< \rna.{{XX^)-^)\\{XX^)-d‘^Xei\\^ < \ma.{{XX^)-^)e^HH^e, 

= Xma.{{Z^Z^)-^)ejHH^ei. (19) 

The first term follows that 

= {X^UZ^Z^))-^ < Xrmn{ZZ^)-^X^in{^)-^ = p-^Xmin{p-^ZZ^)-^X^in{^) 

( 20 ) 


< —Xrmn{p-^ZZ^)-\ 
P 


where the last step is due to equation ([^. According to A2, for some (Si > 0 and Ci > 1, we 
have 

p(^Xmax{p~^ZZ'^) > Cl or Xmin(p~^ZZ'^) < < exp ( - CiTl) , 


which together with (20) ensures that 


p(a^,4(zsz^)-i) > 


ciC4n 

P 


< P 


{^-^{X^in{p-^ZZ'^))-^ > 


-I rv r7T\\-^ ^ Cl 04 / 1 ’’ 


P 


= P( Xrr^Up-^ZZ^) < crM < 


( 21 ) 
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Combining Lemma and (21) entails that for the same Ci > 0, 

P^\\ejX'^{XX'^)~^\\l > CiC 2 ^ 4 ^ ^ ^ < 3exp ( — Cin). 
For w, according to the g-exponential tail assumption, we have 

n 

^(lE aiei/a\ > t) < exp(l — q{t)). 

i=l 


T-v , . , y.l/2—2T — K - 

By choosing t = — \/\s^ -? have 


P\ Itnl > 


-y/logu 


^ ,1 /ycrnV2-2.- 


\n 


Combining it with (22), taking the union bound, we have 

\/CiCic'2Ci ^ \ ^ 


^ \Vi\ > 


a 


-y/logn 
The proof is completed. 


P 


< exp < 1 — g 


V 


n 


(22) 


+ 3 exp ( — Ciu). 


(23) 

□ 


Proof of the theorems 

By now we have all the technical results needed to prove the main theorems. The proof of 


Theorem 1 follows the basic scheme of Fan and Lv (2008) but with a modihcation of their 


step 2 by using our Lemma The proof of Theorem 2 is a direct application of Lemma 
and step 4 in the proof of Theorem 1. Theorem 3 mainly use the properties of Taylor 
expansion on matrix elements. 


Proof of Theorem 1. Applying Lemma and Lemma to all i & S, we have 

-,1—T — K,\ ( / — 5r —2 k —' 


n 


P\ min |C| < c- 

' i&S p 


= 0< s ■ exp 


2 logn 


(24) 


and 


D/ II. cTyC'iCic'2C4U^ ^ [ f ''M , o ^ ^ ^ 

P ( max \pi\ > — -y— ) ~ ‘ ~ 5' 1 I j + 3s • exp ( — CiUj. 

(25) 


ies 


\/\ogn p 


Because s = c^rp with z/ < 1, taking C = 2Ci, (24) can be updated as 

1—r— k;\ ( / — 


I 11^ 

P min A < c- 

' i£S p 


= 0< exp 


2 logn 


(26) 
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Therefore, if we choose 7 ^ such that 


c + (j^ C 1 C 1 C 2 C 4 ‘ " cn 


y/\ogn p 

or in an asymptotic form satisfying 


— T—K. ^ ^\ — T—K. 

<ln< 


2 p ’ 


(27) 


n 


P7n „ , P7nVlog^ , 

U and —;-^ -(X), 


l — T — K. 


n 


l — T — K 


(28) 


then we have 


F min |/3i| < 7 ^ = F min + pi\ < 7 , 

i^S } \ iGS 


n 


1 —r—K. 


< FI min IF I < c- 
' ies p 


a\JCxCxd^Ci^ '' 


+ F max| 7 i|> - 

* 6 S V log n 


2 logn 

This completes the proof of Theorem 1. 


p 

—2t—k 


= 0 -!exp| --j|+s.exp|l- 9 (^^j= 


:n 


□ 


Proof of Theorem 2. According to Lemma for any i ^ S and any C > 0, there exists 
a c > 0 such that 


F > 


c n 


l—T—hi 


\/logn p 
Now with Bonferroni’s inequality, we have 


< 0< exp 


-Cn 


1—5t—2k:— h' 


2 logn 





max t 

i\ > ,, - 

= F max 

V 

Vlog n p J 

V 


c n 


1 — r—K: 


< 0<p ■ exp 


\/logn p 

_^^1 —5r—2 k:— z/ '' 


2 logn 


Also, applying Bonferroni’s inequality to (23) in the proof of Lemma gives 


(29) 


Dl II. PyC'iCic'jCin' “ 7 ^ f, f “1 I , , t r' \ 

F( max \Pi\ > — / I - 7 — ] < P' exp ~ *?( -- ) f + "IP' exp ( — Cinj. 


\/logn p 


Now recall that 


logp = o 


min 


nl- 2 K- 5 r /y^^^l/2-2 r-K\ \ 

2 logn ’^V \/logn / i / 


(30) 
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we have for the same Ci specihed in A2 (with the corresponding c), 


P ( max |,^i| > 


c n 


1 — T — K 


i^S 


\/logn p 




< 0{ exp -C 




n 


1—5r—2 k— h* 


nf II. (T^C'iCiC'2C4n^ ^ 

F max|pi| > ^ - -< 0< 

V * Vlogn p J y 


^ 2 log n 

. ^ 1 /yCinV2-2r- 


+ exp 


(31) 



(32) 


Now if 7 n is chosen as the same as in Theorem 1, we have 


i^S 


P[ max |/3i| > 7n < 0< exp ( - Ci 


n 


1—5t—2/^—1^ 


21ogn 


+ exp I 1 — -g 


1 /yCinV2-2- 


2^V a/Ic^ 


n 


Therefore, combining the above result with Theorem 1 and noticing that s < p, we have 


f( min \^i\ > 7n > max |/3i| ) = l-Ol exp ( -Ci 


ies 


i^S 


n 


1—5t—2k—1/ 


2 logn 


^ +exp ^ 


^ 1 /yCinl/2-2r- 

2^\ a/Io^ 


:n 


Obviously, if we choose a submodel with size d that d > s we will have 


P{Ms C Md) = 1 - o| exp ( - Cl 


n 


1—5t—2k—1/ 


21ogn 


+ exp 1 — -g 


1 /yOini/2-2r- 


2*^V \/i^ 


n 


(33) 


which completes the proof of Theorem 2. □ 

Proof of Corollary 1. Replacing q{t) by /K‘^ for some Cq > 0, the condition (|^ 
becomes 


logp = o 


n 


1-2k-5t 


logn 


and the result becomes 


F( min \( 3 i\ > 7 n > max |/3i| ) = 1 - 0< exp ( - Ci 


ieS 


i^S 


n 


1—2k—5t—u 


= 1 - 0<^ exp -C 


2 logn 

1—2k— 5r—z/ 


+ exp 


n 


2 logn 


The proof of Corollary 1 is completed. 


(34) 


CoCi 1 

K'^ 21ogn / / 

(35) 

□ 


Proof of Theorem 3. It is intuitive to see that when the tuning parameter r is sufficiently 
small, the results in Theorem 2 should continue to hold. The issue here is to hnd a better 
rate on r to allow a more flexible choice for r. Following the ridge formula in Part A of the 
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Supplementary Materials, the HOLP solution can be expressed as 


/3(r) = X'^{XX^ + r4)-'X/3 + X^{XX^ + rl^yh := ^(r) + r/(r). (36) 

We look at .^(r) first. Using the notations in the very beginning of this section, we write 

X^{XX^ + rIn)~^X = T}/'^U DV^ {V DU^HU DV^ + rIn)~^V DU^T}/"^ 

= + rD-^)-^U^T}/^ = T}/^UA-^{In + rA-^ D-^A-^Y^A-^U^T}/"^ , 

where A = (U'^EU)^^^, the square root of a positive dehnite symmetric matrix, i.e., U'^EU = 
A^A = AA^ = A^. In order to expand the inverse matrix by Taylor expansion, we need to 
evaluate the largest eigenvalue of the matrix A~"'"D~‘^A~^, where 

\maAA-^D-^A-^) < X^aAD-^)X^^,{{AAY-^) = X^mniDY^X^UU^EUY^- 

According to A1 and A3, we have 

P{p-Armn{D^) < cY) < 

for some Ci > 1 and Ui > 0 and 


Amm(U^SU) > X^in{E)Xmin{U'^U) > C4 

Therefore, with probability greater than 1 — we have 

X„,aAA-^D-^A-^) < (37) 

P 

meaning that when r < pCi^Y^n~'^, the norm of the matrix r A~^ D~^ A~^ is smaller than 1, 
and that the inverse of the matrix can be expanded by the following Taylor series as 

OO 

Y}PUA-\ln + rA-^D-‘^A-^Y^A-^U^E^/^ = E^/^UA-\ln + ^A{A-^ D-‘^ A-^Y)A-^U^E^/^ 

k=l 

OO 

= HH^ + ^YE^/^UA-\A-^D-‘^A-^YA-^U^E^/^ = HH^ + M. 

fc=i 

The largest eigenvalue for each component of the inhnite sum in the above formula can be 
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bounded as 


<\rnax{T}I^UA-^A-^U^T}l^)\rnax{{A-^D-‘^A-y) 
= Xmax{HH^)X,nax{A-^D-^A-^)^ 

< Xmax{A-^D-^A-^)\ 


and so is their infinite sum as 


XmaxiM) < Y,r’^^max{A-^D-‘^A-^f < — 


rX^ax{.A ^A 


k=l 


1 - r\„,,(A-^D-^A-i) 


(38) 


The last step in the above formula requires rXmax(A ^A to be less than 1, and 


according to (37), it is true with probability greater than 1 — e Now with equation (37) 


and (38), we have 


P Xmax(M) > 


CiC4rn 
p — CiCiVn’ 


< P 


rXmaxjA ‘^A CiC4rn^ 

1 - rXmax{A-'^D-^A-^) ^ p - cic^rn 


= P{Xmax{A-^D-^A-^)> 


CiC^n 

p 


< e 


-Cin 


(39) 


With the above equation, the following steps are straightforward. By choosing an appropriate 
rate of r, the entries of M will be much smaller than the entries of HH^, and the results 


established in Theorem 2 will remain valid. For any i G {1, 2, • • • ,p}, according to (16) we 
have 


max |e- M^f3 < ||/?|| Amaa;(M)^ < c'c^rAXmax{My 


Hence if r satisfies rnX'P'^^'^ ^ 0 to ensure 

Cl C 4 \/(?c^r 77,^/^'^+''“ ^ 
1 — CiC/prrA /p 


= 0 ( 1 ), 


we can obtain an upper bound on max* |efM/d| following (39) as 

p-T-K c\C4\/d' 


p{ max |efM/3| > 


n 


p 


1 — CiCiTn^ /p 


< p[ yjdCiTdXmax{M) > 


Cl €41/0^™ 2'' 

p — CiC^rn'^ 


< e 


-Cin 
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Thus, 


P 


max \efM(3\ > oil)^ -^ < e 

*e{i,-",p} p J 


Recall the fact that ^i{r) = 6 + eiM/3. Combining the above equation and the results 


obtained in (26) and (31), similar properties for ^j(r) can be established as 

1—T—K.\ ( / ^l—5r—2/^—1^ 


/ Tl 

^ max|^.(r)| > o(l)- 

V p 

( C T) 

< 2 — 


< 0{ exp -C 


n 


< 0< exp ( — Cl 


21ogn 

l—bT—2hi—u 


n 


21ogn 


(40) 


where c is some positive constant and o(l) represents -J= + / —. 

^ \ J ^ Vlogn l—cic^rn^/p 

Second, we look at //(r). By a similar argument, the previous results on r] in Theorem 2 
can be generalized. Because 


Vi{r) 


eiX^ {XX^ +r4)-'e. 


it follows 

var{pi{r)\X) = a\JX^{XX^ + rQ-^Xei 

= a^eJX^{XX^ + rIn)-^/^{XX'^ + rIn)~\XX'^ + rIn)~^^^Xei 
< a^X^a.{{XX^ + r4)-i) • eJX^iXX^ + rQ-^Xe, 

= a^{Xmin{XX^ + rln))-^ • eJX^iXX^ + rQ-^Xa. 

Using the same notation in the ^{r) part, we have 

eJX^iXX^ + rIn)-^Xei = e^HH^a + e^Ma < ejHH^e, + Xma^M), 

and for Xmini.XX'^ + r/„), it can be expressed as 

X^,n{XX^ + rJ„) = r + X^,n{XX^) > 

Therefore, the conditional variance of can be reformulated as 

var{p,{r)\X) < a\X^,n{XX'^)Y\eJHH^Ci + A^,,(M)) 

= a\X^,n{XXYY^ejHH^ei + a\X^i^{X X^Y^ 

= a2(A^,„(ZSZ^))“'efM^e, + a\Xm^n{ZT.ZYY^^ma.{M). (41) 
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The first term in the above formula appears as var{T]i\X) in (19) in the proof of Lemma 
while the second term a‘^{Xmin{ZT,Z'^))~^Xmax{M) is introduced by the ridge parameter r. 
If we are able to show that this new conditional variance has the same bound as specihed 
in equation (22) (with a different constant), then a similar result of Lemma can also be 
established for ? 7 i(r), i.e., 


0/1 ^ M ^ a^2C'iCic'2C4 ^ \ l , ^ \ 

hir)\ > - ^ - —j <exp|l-g(^- ^===-) |. + 5exp ( - Cm). 




(42) 


and therefore, 


P[ max |? 7 i(r)| > 


(Ta/2C'iCiC2C4 n^~ 




n 


P 


< O/ exp I 1 - 29(^j^ 


m 


Cl 

+ exp ( - yn 
(43) 


In fact, a similar equation as (22) can be verihed for this new variance directly from (21) 

, -1 


and (39). Since [Xmin{Z'ZZ^)) = Xmax{{Z'ZZ^) ^), by inequality (21) and (39) we have 


P i {Xmin{ZTjZ"'")^ Xmax{M) > 


^ ^ 2 CiC 4 n‘ Cidrn' \ ^ 


a 


p p — ciCivn' 

Rearranging the lower bound in the probability gives 


< 2e“ 


p[a\Xrn^n{Z^Z'^)) ^X^ax{M)> 


n 


1+2t 1 


p^ 1 — CiC^rn^ /p 

If r satisfies the condition stated in the theorem ensuring 


< 2e 


—C\n 




1 — ciCirn^ /p 


= o(l), 


then it holds that 


/ 1 77- 

P(a\X^in{Z^Z^)y^Xmax{.M) > o(l)- 


1+2t 


pZ 


< 2e 


—C\n 


which combined with (22) and (41) entails that 


n 


l+2r 


P\^ar{pi{r)\X) > 2ciC2*^4 " ^ 1 < 5e 


-Cin 


and thus proves equation (43) (by following the argument in Lemma |^. 
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Finally, combining (40) and (43) we have, 


F( min|A(r)| < - 

i&s 4 p 


< 0< exp -C 


n 


1— 5r— 2k— y 


n 


1 — T — K 


P[ max|A(r)| > o(l)- 

1^0 p 


< 0{ exp -C 


2 logn 

1—5r—2 ac— z/ 


+ exp (I - q 




—2t—k 


n 


^ 2 log n 


+ exp { 1 — q 


V 


n 

—2t—k 


V 


n 


where o(l) is used to denote + ac4^ 

^ ^ \/logn 1—cic 


L—ciC 4 rn 


^^5/ 2t+k 1 I inhnitesimal. 

fp ’ 


Therefore, if we choose 7„ such that 


■v/logn 
or in asymptotic form, 

InP 


c + a^/2C^c^d^ ^ ciC4\/c'c4rn^/^'^+'^ ^ ^ n^~ 


1 — CiCiTn'^ /p 


P 


c n 

< 7n < 7 


1 — K — T 


4 p 


n 


1 — K — T 


^ ^ JnPV^ogn j 7nP ^ 

—)■ 0 and —;- > oo and —5- > 00. 


n 


l — K—T 


3, 
TO 2 


Then we can conclude similarly as Theorem [T] and that 


P{ min|/3i(r)| >7^ > max |/3i(r)| 
ies i^S 


= 1 — 0\ exp ( — Cl 


n 


1—5t—2«:—Z2 


+ exp 1 - -g 


1 /VCinV2-2r- 


2 logn 

Obviously, if we choose a submodel with size d that d > s we will have 


2*^V 3/Ibg 


n 


n 


1—5t—2k—u 


P|V(,cAip=l-o|exp(-C,: 


+ exp 1 - -g 


1 /yOin72-2r- 


2*^V 


n 


(44) 


(45) 


(46) 


The proof is now completed. 


□ 


D: Additional simulation 

This section contains the plots for ridge-holp in Simulation study 2 in Section 4.2 as well as 
the detailed simulation results for (p, n) = (1000,100) in Simulation Study 1 in Section 4.1 
and Simulation Study 4 in Section 4.4. 
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ridge-HOLP with R^= 90% 


ridge-HOLP with R^= 50% 




Figure S.l: ridge-HOLP (r = 10): P(minjg5 \( 3 i\ > maXi^s |AI) versus sample size n. 


Table S.l: The probability to include the true model when {p, n) = (1000,100) for Simulation 
Study 1 in Section 4.1 



Example 


HOLE 

SIS 

RRCS 

ISIS 

FR 

Tilting 



(i) Independent predictors 


0.685 

0.690 

0.615 

0.270 

0.370 

0.340 




CO 

o 

II 

0.195 

0.135 

0.195 

0.050 

0.005 

0.000 



(ii) Compound symmetry 

p = 0.6 

0.020 

0.010 

0.040 

0.005 

0.000 

0.000 




II 

o 

0.000 

0.000 

0.000 

0.000 

0.000 

0.010 




CO 

o 

II 

0.810 

0.810 

0.790 

0.510 

0.555 

0.525 



(iii) Autoregressive 

p = 0.6 

0.970 

0.985 

0.970 

0.560 

0.390 

0.355 


= 50% 


p = 0.9 

0.990 

1.000 

1.000 

0.500 

0.185 

0.160 


II 

to 

0.295 

0.000 

0.000 

0.045 

0.135 

0.105 



(iv) Factor models 

/c = 10 

0.060 

0.000 

0.000 

0.000 

0.000 

0.025 




k = 20 

0.010 

0.000 

0.000 

0.000 

0.000 

0.000 




S'^ = 0.1 

0.935 

0.970 

0.950 

0.000 

0.000 

0.000 



(v) Group structure 

,52 = 0.05 

0.950 

0.970 

0.950 

0.000 

0.000 

0.000 




,52 = 0.01 

0.960 

0.980 

0.970 

0.000 

0.000 

0.000 



(vi) Extreme correlation 


0.305 

0.000 

0.000 

0.000 

0.000 

0.020 



(i) Independent predictors 


1.000 

0.995 

0.990 

0.990 

1.000 

1.000 




CO 

0 

II 

0.980 

0.815 

0.705 

0.955 

1.000 

0.990 



(ii) Compound symmetry 

p = 0.6 

0.830 

0.580 

0.435 

0.305 

0.575 

0.490 




p = 0.9 

0.100 

0.030 

0.055 

0.005 

0.000 

0.050 




CO 

0 

II 

0.990 

0.965 

0.945 

1.000 

1.000 

1.000 



(iii) Autoregressive 

p = 0.6 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

7^2 

= 90% 


II 

0 

1.000 

1.000 

1.000 

0.970 

0.985 

1.000 


II 

0.940 

0.015 

0.000 

0.490 

0.950 

0.960 



(iv) Factor models 

k=lCi 

0.715 

0.000 

0.000 

0.115 

0.370 

0.455 




k = 20 

0.430 

0.000 

0.000 

0.015 

0.105 

0.225 




6'^ = 0.1 

1.000 

1.000 

1.000 

0.000 

0.000 

0.000 



(v) Group structure 

,52 = 0.05 

1.000 

1.000 

1.000 

0.000 

0.000 

0.000 




^2 = 0.01 

1.000 

1.000 

1.000 

0.000 

0.000 

0.000 



(vi) Extreme correlation 


0.905 

0.000 

0.000 

0.000 

0.150 

0.110 


47 


































Table S.2: Model selection results when (p, n) = (1000,100) for Simulation Study 4 in Section 
4.4 


example 


#FNs 

#FPs 

Coverage(%) 

Exact (%) 

Size 

w-m 

time (sec) 


Lasso 

0.05 

0.78 

95.0 

44.5 

5.73 

2.20 

0.09 


SCAD 

0.00 

0.02 

100.0 

98.0 

5.02 

0.59 

1.37 

(i) Independent 
predictors 

ISIS-SCAD 

0.01 

0.05 

99.0 

94.0 

5.04 

0.59 

11.22 

SIS-SCAD 

0.10 

0.04 

91.5 

89.5 

4.94 

0.90 

0.19 

RRCS-SCAD 

0.16 

0.07 

86.5 

85.5 

4.91 

1.09 

0.54 

5 = 5,P||2 =3.8 

FR-Lasso 

0.01 

0.78 

99.0 

55.0 

5.62 

1.75 

67.87 

FR-SCAD 

0.00 

0.04 

100.0 

96.0 

5.04 

0.59 

68.02 


HOLP-Lasso 

0.09 

0.82 

94.0 

43.5 

5.72 

2.23 

0.06 


HOLP-SCAD 

0.06 

0.04 

95.5 

94.0 

4.98 

0.61 

0.20 


HOLP-EBICS 

0.18 

0.02 

83.5 

83.0 

4.84 

0.93 

0.05 


Tilting 

0.00 

0.05 

100.0 

95.0 

5.05 

0.59 

294.7 


Lasso 

2.64 

2.38 

9.0 

0.0 

4.74 

9.85 

0.12 


SCAD 

0.28 

8.15 

75.5 

1.0 

12.97 

8.33 

5.64 

(ii) Compound 

ISIS-SCAD 

1.52 

5.82 

22.0 

1.5 

9.30 

8.68 

20.15 

SIS-SCAD 

1.59 

4.58 

24.0 

5.5 

7.99 

8.83 

0.58 

symmetry 

RRCS-SCAD 

1.79 

4.78 

18.5 

5.0 

7.99 

9.22 

1.04 

s = 5, ||/3||2 = 8.6 

FR-Lasso 

0.93 

5.43 

50.0 

1.0 

9.50 

8.51 

92.81 

FR-SCAD 

0.74 

6.50 

56.0 

1.0 

10.76 

7.26 

93.24 


HOLP-Lasso 

2.41 

2.48 

12.0 

0.0 

5.07 

9.61 

0.10 


HOLP-SCAD 

0.34 

5.58 

72.5 

3.0 

10.24 

6.84 

0.52 


HOLP-EBICS 

1.06 

2.62 

34.0 

11.0 

6.56 

7.16 

0.19 


Tilting 

3.07 

5.07 

20.0 

0.0 

7.00 

9.82 

238.2 


Lasso 

0.00 

1.12 

100.0 

0.0 

4.12 

0.84 

0.12 


SCAD 

0.00 

0.03 

100.0 

97.5 

3.66 

0.36 

1.31 

(iii) Autoregressive 
correlation 

ISIS-SCAD 

0.00 

0.01 

100.0 

99.0 

3.01 

0.30 

14.27 

SIS-SCAD 

0.00 

0.02 

100.0 

98.5 

3.02 

0.30 

0.16 

RRCS-SCAD 

0.00 

0.01 

100.0 

99.0 

3.01 

0.30 

0.66 

s = 3,||/3||2 = 3.9 

ER-Lasso 

0.00 

1.12 

100.0 

0.0 

4.12 

0.73 

96.01 

ER-SCAD 

0.00 

0.04 

100.0 

96.5 

4.70 

0.46 

96.19 


HOLP-Lasso 

0.00 

1.16 

100.0 

0.0 

4.16 

0.83 

0.06 


HOLP-SCAD 

0.00 

0.01 

100.0 

99.0 

3.01 

0.28 

0.15 


HOLP-EBICS 

0.00 

0.00 

100.0 

100.0 

3.00 

0.28 

0.10 


Tilting 

0.00 

0.01 

100.0 

99.0 

3.01 

0.28 

233.9 


Lasso 

4.37 

4.89 

0.5 

0.0 

5.52 

11.20 

0.13 


SCAD 

0.30 

20.18 

75.0 

0.0 

24.88 

13.29 

3.23 


ISIS-SCAD 

2.64 

14.82 

7.0 

2.0 

17.19 

15.50 

18.80 

(iv) Factor Models 

SIS-SCAD 

3.89 

15.94 

0.5 

0.0 

17.05 

16.58 

0.66 


RRCS-SCAD 

3.86 

16.54 

0.5 

0.0 

17.68 

16.62 

1.03 

s = 5,||/3||2 = 8.6 

ER-Lasso 

4.77 

3.90 

0.5 

0.0 

6.13 

11.92 

95.86 


ER-SCAD 

2.87 

12.08 

16.0 

0.0 

14.21 

15.80 

96.32 


HOLP-Lasso 

3.83 

4.41 

0.5 

0.0 

5.58 

11.20 

0.07 


HOLP-SCAD 

0.81 

11.22 

65.0 

2.5 

15.41 

10.72 

0.67 


HOLP-EBICS 

1.45 

7.29 

29.0 

6.0 

10.84 

11.37 

0.11 


Tilting 

2.98 

3.15 

14.3 

2.0 

5.17 

12.02 

165.8 


Lasso 

9.34 

0.12 

0.0 

0.0 

5.77 

14.08 

0.13 


SCAD 

11.94 

63.76 

0.0 

0.0 

66.82 

26.64 

3.74 

(v) Group 
structure 

ISIS-SCAD 

11.97 

19.20 

0.0 

0.0 

22.24 

22.90 

21.08 

SIS-SCAD 

11.93 

18.33 

0.0 

0.0 

21.39 

22.57 

0.54 

RRCS-SCAD 

11.93 

18.03 

0.0 

0.0 

21.10 

22.65 

0.94 

5 = 5,P||2 = 19.4 

ER-Lasso 

11.51 

0.73 

0.0 

0.0 

4.22 

18.52 

96.64 

ER-SCAD 

11.96 

19.48 

0.0 

0.0 

23.84 

24.84 

97.02 


HOLP-Lasso 

9.29 

0.12 

0.0 

0.0 

5.83 

14.07 

0.11 


HOLP-SCAD 

11.93 

18.32 

0.0 

0.0 

21.38 

22.52 

0.42 


HOLP-EBICS 

11.95 

1.03 

0.0 

0.0 

4.08 

22.32 

0.18 


Tilting 

11.75 

0.37 

0.0 

0.0 

3.62 

22.95 

175.5 


Lasso 

2.35 

8.20 

8.0 

0.0 

10.84 

10.14 

0.12 


SCAD 

0.03 

0.10 

97.5 

92.0 

5.07 

1.93 

3.28 

(vi) Extreme 
correlation 

ISIS-SCAD 

4.84 

3.78 

0.0 

0.0 

3.94 

13.80 

20.00 

SIS-SCAD 

4.98 

2.07 

0.0 

0.0 

2.09 

12.35 

0.89 

RRCS-SCAD 

4.98 

2.06 

0.0 

0.0 

2.08 

12.35 

1.38 

s = 5, ||^||2 = 8.6 

ER-Lasso 

2.89 

6.41 

1.0 

0.0 

8.52 

11.28 

87.24 

ER-SCAD 

3.08 

3.15 

0.5 

0.5 

5.07 

12.41 

87.60 


HOLP-Lasso 

2.26 

8.16 

8.5 

0.0 

10.90 

10.13 

0.09 


HOLP-SCAD 

0.03 

0.08 

97.5 

93.5 

5.05 

1.46 

0.52 


HOLP-EBICS 

0.70 

0.70 

46.0 

46.0 

5.00 

5.91 

0.16 


Tilting 

4.47 

3.17 

0.0 

0.0 

3.70 

12.54 

208.8 
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E: An analysis of the commonly selected genes 


We summarize the commonly selected genes in Table S.3 The listed genes are all selected 
by at least two methods. In particular, gene BE107075 is chosen by all methods other than 


tilting. Breheny and Huang (2013) reported that this gene is also selected via group Lasso 
and group SCAD, and we hnd that by htting a cubic curve, it can explain more than 65% 
of the variance of TRIM32. Interestingly, tilting selects a completely different set of genes, 
and even the submodel after screening is thoroughly different from other screening methods. 
This result may be explained by the strong correlations among genes, as the largest absolute 
correlation is around 0.99 and the median is 0.62. 


Table S.3: Commonly selected genes for different methods 


Probe ID 

1376747 1381902 1390539 1382673 

Gene name 

BF107075 Zfp292 BF285569 BF115812 

Lasso 

SCAD 

yes yes yes 

yes yes yes 

ISIS-SCAD 

SIS-SCAD 

RRCS-SCAD 

yes 

yes yes 

yes yes 

FR-Lasso 

FR-SCAD 

yes yes 

yes yes yes 

HOLP-Lasso 

HOLP-SCAD 

HOLP-FBICS 

yes yes 

yes yes 

yes 

Tilting 
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